Show the code
library(tidyverse)
library(lme4)
library(lmerTest)
library(emmeans)
library(lattice)
library(plotly)
library(censReg)
library(broom)This talk, given in January 2026, focused on data from cohort 4 of AK UNiTE in the summer of 2025.
For the first step in the analysis, I loaded the appropriate packages.
library(tidyverse)
library(lme4)
library(lmerTest)
library(emmeans)
library(lattice)
library(plotly)
library(censReg)
library(broom)I used the Cohort4 data from the Summer 2025 Data Set. Read in the raw data Cohort4 from the data-raw folder into the environment.
I then wanted to review the data to ensure we know which types of data are in each column.
glimpse(cohort4)Rows: 77
Columns: 46
$ StudyNumber <dbl> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6…
$ ProgramMonth <chr> "August 2025", "July 2025", "June 2025", "Augu…
$ APPType <chr> "Document the mentoring experience from the me…
$ PreviousPart <chr> "Yes", "Yes", "No", "No", "No", "No", "No", "N…
$ PreviousCohort <chr> "Cohort 3 mentoring participant", "Cohort 3 me…
$ YearsResearchTraining <chr> "less than 1 year", "less than 1 year", "less …
$ ResearchTitle <chr> "Undergraduate researcher", "Undergraduate res…
$ W1_AchieveGoals <chr> "Somewhat agree", "Agree", "Agree", "Agree", "…
$ W1_AccomplishTasks <chr> "Somewhat agree", "Somewhat agree", "Somewhat …
$ W1_ObtainOutcomes <chr> "Agree", "Somewhat agree", "Agree", "Agree", "…
$ W1_Succeed <chr> "Agree", "Agree", "Somewhat agree", "Somewhat …
$ W1_Overcome <chr> "Somewhat agree", "Agree", "Somewhat agree", "…
$ W1_ConfidentTasks <chr> "Somewhat agree", "Somewhat agree", "Somewhat …
$ W1_PeerComparison <chr> "Agree", "Somewhat agree", "Somewhat agree", "…
$ W1_ToughPerformance <chr> "Somewhat agree", "Somewhat agree", "Somewhat …
$ W1_AdaptChange <chr> "Often true", "Often true", "Often true", "Oft…
$ W1_Resilience <chr> "Often true", "Often true", "Often true", "Oft…
$ W1_ResearchExpectations <chr> "In my current research project I am working w…
$ W1_SuccessPotential <chr> "In my undergraduate research project I’ve got…
$ W1_MentorSupport <chr> "The biggest thing mentors can do to support t…
$ W1_OpenResponse <chr> "I had a picnic with my friends.", "I am going…
$ W2_DefineAmbigousR <chr> "A ambiguous situation in research is one that…
$ W2_HandleAmbigousR <chr> "Referring to previous literature is a good pl…
$ W2_MentoringAmbigousR <chr> "Mentors can provide moral support as well as …
$ W2_MentoringNG <chr> "I think not being straightforward is the bigg…
$ W2_OpenResponse <chr> "I had a game night with my best friends", "I …
$ W3_OutcomeEffort <chr> "Proving the hypothesis and/or learning new in…
$ W3_Motivation <chr> "Knowing that the research you are conducting …
$ W3_Support <chr> "My personal values are important in preservin…
$ W3_OpenResponse <chr> "I started a new book!", "I went to the arcade…
$ W4_AchieveGoals <chr> "Agree", "Somewhat agree", "Agree", "Agree", "…
$ W4_AccomplishTasks <chr> "Agree", "Agree", "Somewhat agree", "Agree", "…
$ W4_ObtainOutcomes <chr> "Agree", "Somewhat agree", "Agree", "Agree", "…
$ W4_Succeed <chr> "Agree", "Agree", "Somewhat agree", "Somewhat …
$ W4_Overcome <chr> "Somewhat agree", "Agree", "Agree", "Agree", "…
$ W4_ConfidentTasks <chr> "Somewhat agree", "Somewhat agree", "Agree", "…
$ W4_PeerComparison <chr> "Somewhat agree", "Somewhat agree", "Agree", "…
$ W4_ToughPerformance <chr> "Somewhat agree", "Agree", "Agree", "Somewhat …
$ W4_AdaptChange <chr> "Often true", "Often true", "Often true", "Oft…
$ W4_Resilience <chr> "Sometimes true", "Sometimes true", "Often tru…
$ W4_1Scientist <chr> "Being in the lab for the first time in underg…
$ W4_MeaningfulMentor <chr> "I am fortunate enough to have great mentors f…
$ W4_BelongNG <chr> "I’ve had a couple of experiences where I didn…
$ W4_MentorDescription <chr> "Mentoring has played an important role in my …
$ W4_MentorExpectations <chr> "I expect my mentor to show interest in my goa…
$ W4_OpenResponse <chr> "I learned how to make lasagna!", "I am planni…
Next, I needed to assign numbers to the Likert scale choices for the New General Self-efficacy Scale (6-point) and the Connor-Davidson Resilience Scale (5-point). This code maps strongly agree, agree, etc., to the appropriate numbers.
# Define the Likert scale mappings for reuse
likert_6_map <- c(
"Strongly disagree" = 1, "Disagree" = 2, "Somewhat disagree" = 3,
"Somewhat agree" = 4, "Agree" = 5, "Strongly agree" = 6
)
likert_5_map <- c(
"Never true" = 1, "Rarely true" = 2, "Sometimes true" = 3,
"Often true" = 4, "True nearly all the time" = 5
)I then wanted to create a new data object called cohort4_with_composite that includes composite scores for self-efficacy and resilience. To get a more stable measure of a person’s mindset, the code averages individual questions into a single score for that specific week (Modified New General Self-Efficacy and Conor Davidson Resilience).
Self-Efficacy: Computes the mean of 8 items for Week 1 and Week 4.
Resilience: Computes the mean of the 2 resilience items for Week 1 and Week 4.
The final step calculates Hake’s Normalized Gain. This measures how much a participant improved from Week 1 to Week 4 relative to how much room they had for improvement.
How it handles the “Ceiling Effect”: If a participant started Week 1 with a perfect score (a 6 for Self-Efficacy or a 5 for Resilience), they cannot “improve” on the scale. The code identifies these “Ceiling” cases and manually sets their gain to 0. Without this ifelse logic, the formula would try to divide by zero, resulting in NaN (Not a Number) errors in your dataset.
Temporal Sequencing: Maps the ProgramMonth (June, July, August 2025) to numeric values (1, 2, 3) to track progress over the summer.
Experience Levels: Collapses “Years of Research Training” into a numeric scale (1–6).
Role Assignment: Sorts participants into Mentees or Mentors based on their application type (APPType).
Professional Tiering: Groups various titles into three main buckets: Undergraduates, Graduate Students/Post Docs, and Primary Investigators/Faculty.
View the Cohort4_with_composite data table, which also includes the net gains for each participant for each month. In other words, how much each participant changed just during each month.
glimpse(cohort4_with_composite)Rows: 77
Columns: 56
$ StudyNumber <fct> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5…
$ ProgramMonth <chr> "August 2025", "July 2025", "June 2025", "A…
$ APPType <chr> "Document the mentoring experience from the…
$ PreviousPart <chr> "Yes", "Yes", "No", "No", "No", "No", "No",…
$ PreviousCohort <chr> "Cohort 3 mentoring participant", "Cohort 3…
$ YearsResearchTraining <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 1, 2, 2, 2…
$ ResearchTitle <chr> "Undergraduate researcher", "Undergraduate …
$ W1_AchieveGoals <dbl> 4, 5, 5, 5, 4, 4, 5, 5, 5, 5, 6, 5, 6, 5, 5…
$ W1_AccomplishTasks <dbl> 4, 4, 4, 5, 5, 4, 5, 5, 6, 5, 5, 5, 6, 6, 6…
$ W1_ObtainOutcomes <dbl> 5, 4, 5, 5, 5, 5, 5, 4, 5, 5, 6, 6, 6, 5, 5…
$ W1_Succeed <dbl> 5, 5, 4, 4, 3, 4, 6, 5, 4, 5, 6, 5, 5, 5, 4…
$ W1_Overcome <dbl> 4, 5, 4, 5, 6, 5, 6, 5, 5, 5, 5, 5, 6, 6, 6…
$ W1_ConfidentTasks <dbl> 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 5, 6…
$ W1_PeerComparison <dbl> 5, 4, 4, 4, 4, 4, 5, 5, 5, 5, 4, 5, 6, 5, 5…
$ W1_ToughPerformance <dbl> 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 5, 6, 6, 6…
$ W1_AdaptChange <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 4, 5, 5, 5…
$ W1_Resilience <dbl> 4, 4, 4, 4, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5…
$ W1_ResearchExpectations <chr> "In my current research project I am workin…
$ W1_SuccessPotential <chr> "In my undergraduate research project I’ve …
$ W1_MentorSupport <chr> "The biggest thing mentors can do to suppor…
$ W1_OpenResponse <chr> "I had a picnic with my friends.", "I am go…
$ W2_DefineAmbigousR <chr> "A ambiguous situation in research is one t…
$ W2_HandleAmbigousR <chr> "Referring to previous literature is a good…
$ W2_MentoringAmbigousR <chr> "Mentors can provide moral support as well …
$ W2_MentoringNG <chr> "I think not being straightforward is the b…
$ W2_OpenResponse <chr> "I had a game night with my best friends", …
$ W3_OutcomeEffort <chr> "Proving the hypothesis and/or learning new…
$ W3_Motivation <chr> "Knowing that the research you are conducti…
$ W3_Support <chr> "My personal values are important in preser…
$ W3_OpenResponse <chr> "I started a new book!", "I went to the arc…
$ W4_AchieveGoals <dbl> 5, 4, 5, 5, 5, 5, 5, 5, 5, 6, 5, 5, 5, 4, 5…
$ W4_AccomplishTasks <dbl> 5, 5, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6…
$ W4_ObtainOutcomes <dbl> 5, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 4, 4, 5…
$ W4_Succeed <dbl> 5, 5, 4, 4, 3, 4, 6, 6, 5, 6, 5, 6, 4, 5, 5…
$ W4_Overcome <dbl> 4, 5, 5, 5, 5, 5, 5, 5, 5, 6, 5, 6, 6, 6, 6…
$ W4_ConfidentTasks <dbl> 4, 4, 5, 5, 4, 5, 5, 5, 6, 6, 6, 5, 5, 6, 6…
$ W4_PeerComparison <dbl> 4, 4, 5, 5, 4, 4, 5, 5, 5, 6, 6, 4, 5, 6, 5…
$ W4_ToughPerformance <dbl> 4, 5, 5, 4, 5, 4, 5, 5, 5, 6, 6, 5, 6, 6, 6…
$ W4_AdaptChange <dbl> 4, 4, 4, 4, 5, 5, 4, 4, 4, 5, 5, 3, 5, 5, 5…
$ W4_Resilience <dbl> 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5…
$ W4_1Scientist <chr> "Being in the lab for the first time in und…
$ W4_MeaningfulMentor <chr> "I am fortunate enough to have great mentor…
$ W4_BelongNG <chr> "I’ve had a couple of experiences where I d…
$ W4_MentorDescription <chr> "Mentoring has played an important role in …
$ W4_MentorExpectations <chr> "I expect my mentor to show interest in my …
$ W4_OpenResponse <chr> "I learned how to make lasagna!", "I am pla…
$ Composite_Self_Efficacy_W1 <dbl> 4.375, 4.375, 4.250, 4.750, 4.625, 4.500, 5…
$ Composite_Self_Efficacy_W4 <dbl> 4.500, 4.500, 4.750, 4.750, 4.500, 4.625, 5…
$ Composite_Resilience_W1 <dbl> 4.0, 4.0, 4.0, 4.0, 3.5, 3.5, 4.0, 4.0, 4.0…
$ Composite_Resilience_W4 <dbl> 3.5, 3.5, 4.0, 4.0, 4.5, 4.5, 4.0, 4.0, 4.0…
$ Month_Number <dbl> 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1…
$ Years_Factor <fct> 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 1, 2, 2, 2…
$ Mentoring_Category <fct> mentee, mentee, mentee, mentee, mentee, men…
$ Research_Category <chr> "Undergraduates", "Undergraduates", "Underg…
$ g_SelfEfficacy <dbl> 0.07692308, 0.07692308, 0.28571429, 0.00000…
$ g_Resilience <dbl> -0.5000000, -0.5000000, 0.0000000, 0.000000…
There are still a few more steps to prepare the data for the analyses we want to perform. To run the linear mixed-effects model, we pivot the columns to make the data long rather than wide. We also ensure that pre comes before post and that both the resilience composite score and the self-efficacy composite score are numeric variables.
Since we are also interested in whether the length of participation influenced gains in self-efficacy and resilience, we calculate this for each participant.
#--------------Step 2-------------
# Define the composite score columns to pivot
score_cols_to_pivot <- c(
"Composite_Self_Efficacy_W1", "Composite_Self_Efficacy_W4",
"Composite_Resilience_W1", "Composite_Resilience_W4"
)
cohort4_long_final <- cohort4_with_composite %>%
# Pivot the 4 composite columns longer
pivot_longer(
cols = all_of(score_cols_to_pivot),
names_to = c(".value", "TimePoint"),
names_sep = "_W",
names_prefix = "Composite_"
) %>%
# === Final LMEM Variable Setup ===
mutate(
# Create the key Pre/Post factor with correct levels
Measure_Type = factor(
ifelse(TimePoint == "1", "Pre", "Post"),
levels = c("Pre", "Post")
),
# Ensure scores are numeric
Resilience = as.numeric(Resilience),
Self_Efficacy = as.numeric(Self_Efficacy)
) %>%
# Calculate Length of Participation (count of Pre/Post measures)
group_by(StudyNumber) %>%
mutate(
LengthParticipation = n()
) %>%
ungroup() %>%
# Remove NAs and unnecessary columns
filter(!is.na(Self_Efficacy) | !is.na(Resilience)) %>%
select(-TimePoint)The way these data are organized allows us to see whether net gains changed over one month or from the beginning to the end of the period during which a participant was engaged in the reflections.
Others have found that self-efficacy and resilience are correlated Abdolrezapour, Ganjeh, and Ghanbari (2023), so we wanted to analyze our participants to see whether we also find a similar correlation.
At this point, I exported the newly created data table to an R script named cor_self_and_res.
For this first correlation between self-efficacy and resilience, we are using each participant’s pre-score regardless of when they started. This first segment of code creates an object/data frame that filters the necessary comparisons.
first_baseline_correlation <- cohort4_long_final %>%
# 1. Filter only for baseline entries
filter(Measure_Type == "Pre") %>%
# 2. Group by participant
group_by(StudyNumber) %>%
# 3. Select the row with the smallest Month_Number (the earliest month)
# If you have month names, ensure Month_Number is numeric (1=June, 2=July, etc.)
slice_min(order_by = Month_Number, n = 1, with_ties = FALSE) %>%
# 4. Cleanup and Grouping
ungroup() %>%
select(StudyNumber, Self_Efficacy, Resilience, Mentoring_Category, Month_Number)Now it is time to run the correlation and visualize it.
# Calculate the correlation coefficient
cor_val <- cor.test(first_baseline_correlation$Self_Efficacy,
first_baseline_correlation$Resilience,
method = "pearson")
print(cor_val)
Pearson's product-moment correlation
data: first_baseline_correlation$Self_Efficacy and first_baseline_correlation$Resilience
t = 4.5648, df = 38, p-value = 5.118e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3481325 0.7648295
sample estimates:
cor
0.5951048
A Pearson correlation coefficient was computed to assess the linear relationship between self-efficacy and resilience using the first time they completed the instrument. There was a positive correlation between the two variables, r(38) = .595, p <.001.
ggplot(first_baseline_correlation, aes(x = Self_Efficacy, y = Resilience)) +
# 1. Add jittered points to see density of Likert responses
geom_jitter(width = 0.1, height = 0.1, alpha = 0.5, size = 2.5, color = "#2c3e50") +
# 2. Add the Pearson regression line with Confidence Intervals (se = TRUE)
geom_smooth(method = "lm", se = TRUE, fullrange = TRUE, color = "black", fill = "#bdc3c7", alpha = 0.4) +
# 3. Standardize the axes to your survey scales
scale_x_continuous(breaks = seq(1, 6, 0.5)) +
scale_y_continuous(breaks = seq(1, 5, 0.5)) +
# 4. Styling and Labels
theme_minimal(base_size = 14) +
labs(
title = "Baseline Correlation: Self-Efficacy vs. Resilience",
subtitle = paste("Pearson's r =", round(cor_val$estimate, 2), "| Shaded area represents 95% Confidence Interval"),
x = "Baseline Self-Efficacy (1-6 Scale)",
y = "Baseline Resilience (1-5 Scale)",
caption = "Data: First entry 'Pre' scores only"
)
#export figure
ggsave(
filename = "figs/BaselineCorreleation.svg",
device = "svg",
width = 8, # Width in inches
height = 6, # Height in inches
units = "in"
)first_baseline_correlation %>%
count(Month_Number, name = "Frequency") %>%
rename("Program Month" = Month_Number) %>%
knitr::kable()| Program Month | Frequency |
|---|---|
| 1 | 13 |
| 2 | 15 |
| 3 | 12 |
Because some participants are in the program for one month and others for three, you want to capture their growth from the moment they entered (True Baseline) to the moment they left (Final Exit).
\[ g = \frac{\text{Last Post} - \text{First Pre}}{\text{Max Possible Score} - \text{First Pre}}\]What it represents: It tells us what percentage of the “available room for growth” the participant actually covered. A student who starts at a 5 and moves to a 6 has a “perfect” gain (1.0), just like a student moving from a 3 to a 6.
cohort4_final_gains <- cohort4_long_final %>%
group_by(StudyNumber) %>%
summarize(
# 1. Participation Length
Months_Participated = n_distinct(Month_Number),
# 2. Self-Efficacy: First Pre and Last Post
First_Pre_SE = Self_Efficacy[Measure_Type == "Pre" & Month_Number == min(Month_Number)][1],
Last_Post_SE = Self_Efficacy[Measure_Type == "Post" & Month_Number == max(Month_Number)][1],
# 3. Resilience: First Pre and Last Post
First_Pre_Res = Resilience[Measure_Type == "Pre" & Month_Number == min(Month_Number)][1],
Last_Post_Res = Resilience[Measure_Type == "Post" & Month_Number == max(Month_Number)][1],
# Keep metadata
Mentoring_Category = first(Mentoring_Category),
Research_Category = first(Research_Category),
.groups = "drop"
) %>%
# 4. Calculate Normalized Gains (Hake's g)
mutate(
# Gain for Self-Efficacy (1-6 scale)
Gain_SelfEfficacy = (Last_Post_SE - First_Pre_SE) / (6 - First_Pre_SE),
Gain_SelfEfficacy = ifelse(First_Pre_SE == 6, 0, Gain_SelfEfficacy),
# Gain for Resilience (1-5 scale)
Gain_Resilience = (Last_Post_Res - First_Pre_Res) / (5 - First_Pre_Res),
Gain_Resilience = ifelse(First_Pre_Res == 5, 0, Gain_Resilience)
)
glimpse(cohort4_final_gains)Rows: 40
Columns: 10
$ StudyNumber <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
$ Months_Participated <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 3, 2, 2, 2, 2,…
$ First_Pre_SE <dbl> 4.250, 4.500, 5.000, 5.250, 5.375, 5.250, 5.500, 4…
$ Last_Post_SE <dbl> 4.500, 4.750, 5.125, 6.000, 5.125, 5.000, 5.000, 5…
$ First_Pre_Res <dbl> 4.0, 3.5, 4.0, 4.5, 5.0, 4.0, 5.0, 3.5, 4.0, 5.0, …
$ Last_Post_Res <dbl> 3.5, 4.0, 4.0, 5.0, 5.0, 3.5, 4.0, 3.5, 4.0, 5.0, …
$ Mentoring_Category <fct> mentee, mentee, mentee, mentee, mentee, mentee, me…
$ Research_Category <chr> "Undergraduates", "Undergraduates", "Undergraduate…
$ Gain_SelfEfficacy <dbl> 0.1428571, 0.1666667, 0.1250000, 1.0000000, -0.400…
$ Gain_Resilience <dbl> -0.5000000, 0.3333333, 0.0000000, 1.0000000, 0.000…
This cohort4_final_gains data frame allows us to ask, “Does a participant who participated for three months have a greater gain than someone who participated for one month?” To ensure that we do the right kind of statistical test, we first want to see if our data are normally distributed
# 1. Visual Check: Histogram of Self Efficacy
ggplot(cohort4_final_gains, aes(x = Gain_SelfEfficacy)) +
geom_histogram(fill = "#7b178e", color = "white", bins = 15) +
theme_minimal() +
labs(title = "Distribution of Normalized Gains for Self-efficacy")# 1. Visual Check: Histogram of Resilience
ggplot(cohort4_final_gains, aes(x = Gain_Resilience)) +
geom_histogram(fill = "#7b178e", color = "white", bins = 15) +
theme_minimal() +
labs(title = "Distribution of Normalized Gains for Resilience")# 2. Statistical Check: Shapiro-Wilk Test
# p > 0.05 means your data is likely Normal.
shapiro.test(cohort4_final_gains$Gain_SelfEfficacy)
Shapiro-Wilk normality test
data: cohort4_final_gains$Gain_SelfEfficacy
W = 0.95083, p-value = 0.08096
#shapiro wilk for resilience
shapiro.test(cohort4_final_gains$Gain_Resilience)
Shapiro-Wilk normality test
data: cohort4_final_gains$Gain_Resilience
W = 0.80281, p-value = 7.758e-06
Since the Shapiro test p-value is greater than .05 for self-efficacy, we can use parametric statistics. But it is less that .05 for Resilience so a Kruskal-Wallis test is more appropriate.
# Run ANOVA
anova_final_gains <- aov(Gain_SelfEfficacy ~ as.factor(Months_Participated), data = cohort4_final_gains)
# View results
summary(anova_final_gains) Df Sum Sq Mean Sq F value Pr(>F)
as.factor(Months_Participated) 2 0.123 0.06173 0.213 0.809
Residuals 37 10.701 0.28920
# 1. Ensure Months_Participated is a factor so it creates separate boxes
cohort4_final_gains <- cohort4_final_gains %>%
mutate(Months_Factor = as.factor(Months_Participated))
# 2. Create the Plot
ggplot(cohort4_final_gains, aes(x = Months_Factor, y = Gain_SelfEfficacy, fill = Months_Factor)) +
# Add the boxplot for the summary statistics
geom_boxplot(alpha = 0.5, outlier.shape = NA) +
# Add each StudyNumber as a jittered dot
# width = 0.2 prevents the dots from spreading too far horizontally
geom_jitter(width = 0.2, size = 2, alpha = 0.7, color = "#2c3e50") +
# Styling
theme_minimal(base_size = 14) +
scale_fill_brewer(palette = "Greens") + # Subtle green gradient for time
labs(
title = "Self-Efficacy Growth by Program Duration",
subtitle = "Normalized Gain (First Pre to Last Post)",
x = "Number of Months Participated",
y = "Normalized Gain (Hake's g)",
caption = "Each dot represents one participant (StudyNumber)"
) +
theme(legend.position = "none") # Legend is redundant since x-axis is labeled
#export figure
ggsave(
filename = "figs/SelfEfficacy_Gains.svg",
device = "svg",
width = 8, # Width in inches
height = 6, # Height in inches
units = "in"
)A one-way ANOVA was conducted to compare the effect of participation length on normalized self-efficacy gains. The results indicated that there were no significant differences between those who participated for 1, 2, or 3 months, \(F(2, 37) = 0.21, p = .809\).
# Filter and test for each month group
t_test_results <- cohort4_final_gains %>%
group_by(Months_Participated) %>%
summarise(
Mean_Gain = mean(Gain_SelfEfficacy, na.rm = TRUE),
P_Value = t.test(Gain_SelfEfficacy, mu = 0)$p.value
)
print(t_test_results)# A tibble: 3 × 3
Months_Participated Mean_Gain P_Value
<int> <dbl> <dbl>
1 1 -0.00825 0.953
2 2 -0.0476 0.732
3 3 0.0929 0.612
resilience_kw <- kruskal.test(Gain_Resilience ~ as.factor(Months_Participated),
data = cohort4_final_gains)
# View the results
print(resilience_kw)
Kruskal-Wallis rank sum test
data: Gain_Resilience by as.factor(Months_Participated)
Kruskal-Wallis chi-squared = 0.52091, df = 2, p-value = 0.7707
A Kruskal-Wallis test showed that normalized gains in resilience did not significantly differ by length of participation, \(\chi^2(2) = .521, p = .775\).
ggplot(cohort4_final_gains, aes(x = Months_Factor, y = Gain_Resilience)) +
# 1. Add the boxplot with the Purples palette
geom_boxplot(aes(fill = Months_Factor), alpha = 0.5, outlier.shape = NA) +
scale_fill_brewer(palette = "Purples") + # Added missing + here
# 2. Add StudyNumbers as dots, colored by the Ceiling condition
geom_jitter(aes(color = (First_Pre_Res >= 5 | Last_Post_Res >= 5)),
width = 0.2, size = 2.5, alpha = 0.8) +
# 3. Apply your specific purple colors
scale_color_manual(values = c("#987eab", "#5e17eb"),
labels = c("Below Ceiling", "At Ceiling")) +
# 4. Styling, Labels, and REMOVING LEGEND
theme_minimal(base_size = 15) +
theme(legend.position = "none") + # This hides all legends
labs(title = "Resilience Normalized Gains by Participation Length",
subtitle = "Darker purple dots indicate participants at the scale ceiling (5.0)",
x = "Months of Participation",
y = "Normalized Gain (Resilience)")
#export figure
ggsave(
filename = "figs/Resilience_Gains.svg",
device = "svg",
width = 8, # Width in inches
height = 6, # Height in inches
units = "in"
)While we did not find that the normalized gains in resilience differed by length of time spent with us, we also wanted to know if there might have been significant gains over the baseline. So we ran a Wilcoxon signed-rank test.
# Non-parametric test for resilience
wilcox_results <- cohort4_final_gains %>%
group_by(Months_Participated) %>%
summarise(
Median_Gain = median(Gain_Resilience, na.rm = TRUE),
P_Value = wilcox.test(Gain_Resilience, mu = 0)$p.value
)
print(wilcox_results)# A tibble: 3 × 3
Months_Participated Median_Gain P_Value
<int> <dbl> <dbl>
1 1 0 0.850
2 2 0 0.387
3 3 0 0.586
Since the baseline remained unchanged regardless of how long participants engaged in the program, we thought it would be interesting to examine some individual trajectories. The code below creates an interactive chart showing the actual score for each and every time a participant answered the survey questions.
# Create a chronological ordering factor
indiv_traject_data <- cohort4_long_final %>%
mutate(
# Create a label like "M1-Pre", "M1-Post", etc.
Time_Label = paste0("M", Month_Number, "-", Measure_Type),
# Create a numeric sequence for the x-axis
# (Month 1 Pre = 1, Month 1 Post = 2, Month 2 Pre = 3, etc.)
Timeline = (Month_Number - 1) * 2 + ifelse(Measure_Type == "Pre", 1, 2)
) %>%
mutate(Time_Label = reorder(Time_Label, Timeline)) # Force R to keep them in order
# Create a list of IDs who stayed for 3 months
completers_3mo <- cohort4_long_final %>%
group_by(StudyNumber) %>%
filter(max(Month_Number) == 3) %>%
pull(StudyNumber) %>%
unique()
indiv_traject_data <- indiv_traject_data %>%
filter(StudyNumber %in% completers_3mo)
# 1. Create a "Shared Data" object
shared_data <- highlight_key(indiv_traject_data, ~StudyNumber)
# 2. Build the plot using the shared data
p_interactive <- ggplot(shared_data, aes(x = Timeline, y = Self_Efficacy, group = StudyNumber)) +
geom_line(color = "steelblue", alpha = 0.3) +
geom_point(alpha = 0.5) +
scale_x_continuous(breaks = 1:6,
labels = c("M1-Pre", "M1-Post", "M2-Pre", "M2-Post", "M3-Pre", "M3-Post")) +
theme_minimal()
# 3. Use ggplotly with 'highlight' options
ggplotly(p_interactive) %>%
highlight(
on = "plotly_hover", # Line lights up when you hover
color = "#45185e", # The color it turns when highlighted
selectize = TRUE, # Adds a dropdown menu to select a specific StudyNumber
persistent = FALSE
)Who falls into a “high”, “medium”, and “low” self-efficacy categories?
# 1. Calculate the Global Mean and Standard Deviation for all rows
# We assume your column is named 'Composite_Self_Efficacy'
global_stats <- cohort4_long_final %>%
summarise(
Mean_SE = mean(Self_Efficacy, na.rm = TRUE),
SD_SE = sd(Self_Efficacy, na.rm = TRUE)
)
# Save these as variables for the thresholds
m <- global_stats$Mean_SE
s <- global_stats$SD_SE
high_thresh <- m + s
low_thresh <- m - s
# 2. Identify participants based on their average score
# We group by StudyNumber so each person has one "dot" on the plot
participant_summary <- cohort4_long_final %>%
group_by(StudyNumber, Research_Category) %>%
summarise(Avg_SE = mean(Self_Efficacy, na.rm = TRUE), .groups = "drop") %>%
mutate(
Outlier_Status = case_when(
Avg_SE >= high_thresh ~ "High (>1 SD)",
Avg_SE <= low_thresh ~ "Low (<1 SD)",
TRUE ~ "Within 1 SD"
)
)
# Print the outliers to the console
print("Participants falling 1 SD or more above the mean:")[1] "Participants falling 1 SD or more above the mean:"
print(participant_summary %>% filter(Outlier_Status == "High (>1 SD)"))# A tibble: 5 × 4
StudyNumber Research_Category Avg_SE Outlier_Status
<fct> <chr> <dbl> <chr>
1 12 Undergraduates 5.88 High (>1 SD)
2 25 Undergraduates 5.97 High (>1 SD)
3 27 Undergraduates 5.66 High (>1 SD)
4 33 Graduate Students 5.75 High (>1 SD)
5 36 Undergraduates 5.81 High (>1 SD)
print("Participants falling 1 SD or more below the mean:")[1] "Participants falling 1 SD or more below the mean:"
print(participant_summary %>% filter(Outlier_Status == "Low (<1 SD)"))# A tibble: 7 × 4
StudyNumber Research_Category Avg_SE Outlier_Status
<fct> <chr> <dbl> <chr>
1 1 Undergraduates 4.46 Low (<1 SD)
2 2 Undergraduates 4.62 Low (<1 SD)
3 9 Graduate Students 4.21 Low (<1 SD)
4 19 Undergraduates 4.62 Low (<1 SD)
5 23 Undergraduates 4.62 Low (<1 SD)
6 28 Graduate Students 4.47 Low (<1 SD)
7 40 Graduate Students 4.56 Low (<1 SD)
# 1. SET THE ORDER: Reorder Research_Category as a factor
participant_summary$Research_Category <- factor(
participant_summary$Research_Category,
levels = c("Undergraduates", "Graduate Students", "Primary Investigators")
)
# 3. Graph the results
ggplot(participant_summary, aes(x = Research_Category, y = Avg_SE, color = Research_Category)) +
# Add the Mean and SD threshold lines
geom_hline(yintercept = m, color = "black", size = 1) +
geom_hline(yintercept = c(high_thresh, low_thresh), linetype = "dashed", color = "#45185e") +
# Add the participants as dots
geom_jitter(width = 0.2, size = 3, alpha = 0.8) +
scale_color_manual(values = c(
"Undergraduates" = "#1b9e77",
"Graduate Students" = "#d95f02",
"Primary Investigators" = "#7570b3"
)) +
# Label the dots with StudyNumber for those outside 1 SD
geom_text(aes(label = ifelse(Outlier_Status != "Within 1 SD", StudyNumber, "")),
vjust = -1, size = 3, fontface = "bold", show.legend = FALSE) +
# Professional Formatting
theme_classic(base_size = 14) +
labs(
title = "Self-Efficacy Distribution by Research Category",
subtitle = paste0("Solid line: Mean (", round(m, 2), ") | Dashed lines: ±1 Standard Deviation"),
y = "Average Self-Efficacy Score",
x = "Research Category",
caption = "Study Numbers are labeled for participants >1 SD from the mean."
) +
theme(legend.position = "none")ggsave(
filename = "figs/ResearchType_ActualScores.svg",
device = "svg",
width = 8, # Width in inches
height = 6, # Height in inches
units = "in"
)We first need to add the Years_Factor of years in research training to our data frame
# 1. Join only the Years_Factor column into your gains data frame
cohort4_final_gains <- cohort4_final_gains %>%
left_join(cohort4_with_composite %>%
dplyr::select(StudyNumber, Years_Factor), # Explicitly tell R to use dplyr
by = "StudyNumber")ggplot(cohort4_final_gains, aes(x = Years_Factor, y = Gain_SelfEfficacy)) +
# Boxplot with Green palette
geom_boxplot(aes(fill = Years_Factor), alpha = 0.5, outlier.shape = NA) +
scale_fill_brewer(palette = "Greens") +
# Add individual points
geom_jitter(width = 0.2, size = 2.5, alpha = 0.7, color = "#2c3e50") +
theme_minimal(base_size = 15) +
theme(legend.position = "none") +
labs(title = "Self-Efficacy Gains by Years of Experience",
x = "Years of Experience (Category)",
y = "Normalized Gain (Self-Efficacy)")#export figure
ggsave(
filename = "figs/SE_Gains_YrsExp.svg",
device = "svg",
width = 8, # Width in inches
height = 6, # Height in inches
units = "in"
)ggplot(cohort4_final_gains, aes(x = Years_Factor, y = Gain_Resilience)) +
# Boxplot with Purple palette
geom_boxplot(aes(fill = Years_Factor), alpha = 0.5, outlier.shape = NA) +
scale_fill_brewer(palette = "Purples") +
# Add individual points, highlighting the ceiling group in dark purple
geom_jitter(aes(color = (First_Pre_Res >= 5 | Last_Post_Res >= 5)),
width = 0.2, size = 2.5, alpha = 0.8) +
scale_color_manual(values = c("black", "#7b178e")) +
theme_minimal(base_size = 15) +
theme(legend.position = "none") +
labs(title = "Resilience Gains by Years of Experience",
subtitle = "Darker purple dots indicate participants at the scale ceiling",
x = "Years of Experience (Category)",
y = "Normalized Gain (Resilience)")#export figure
ggsave(
filename = "figs/Res_Gains_YrsExp.svg",
device = "svg",
width = 8, # Width in inches
height = 6, # Height in inches
units = "in"
)# Tobit for Self-Efficacy (Ceiling = 6)
tobit_se_years <- censReg(Last_Post_SE ~ First_Pre_SE + Years_Factor,
right = 6,
data = cohort4_final_gains)
summary(tobit_se_years)
Call:
censReg(formula = Last_Post_SE ~ First_Pre_SE + Years_Factor,
right = 6, data = cohort4_final_gains)
Observations:
Total Left-censored Uncensored Right-censored
77 0 72 5
Coefficients:
Estimate Std. error t value Pr(> t)
(Intercept) 2.28680 0.41257 5.543 2.98e-08 ***
First_Pre_SE 0.60358 0.07943 7.599 2.98e-14 ***
Years_Factor2 -0.13485 0.10690 -1.261 0.20714
Years_Factor3 -0.36349 0.13236 -2.746 0.00603 **
Years_Factor4 -0.21031 0.22713 -0.926 0.35447
Years_Factor5 -0.49220 0.25751 -1.911 0.05596 .
Years_Factor6 -0.07648 0.11558 -0.662 0.50817
logSigma -1.06135 0.08488 -12.504 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Newton-Raphson maximisation, 6 iterations
Return code 8: successive function values within relative tolerance limit (reltol)
Log-likelihood: -31.91955 on 8 Df
group_summary <- cohort4_final_gains %>%
group_by(Years_Factor) %>%
summarise(
n = n(),
mean_pre = mean(First_Pre_SE, na.rm = TRUE),
mean_post = mean(Last_Post_SE, na.rm = TRUE),
mean_gain = mean(Gain_SelfEfficacy, na.rm = TRUE)
) %>%
mutate(across(where(is.numeric), round, 2))
print(group_summary)# A tibble: 6 × 5
Years_Factor n mean_pre mean_post mean_gain
<fct> <dbl> <dbl> <dbl> <dbl>
1 1 19 5.1 5.35 0.25
2 2 24 5.05 5.19 0.05
3 3 11 5.22 5.06 -0.23
4 4 3 5.71 5.46 -0.53
5 5 2 5 4.81 -0.54
6 6 18 4.88 5.15 0.06
tobit_res_years <- censReg(Last_Post_Res ~ First_Pre_Res + Years_Factor,
right = 5,
data = cohort4_final_gains)
summary(tobit_res_years)
Call:
censReg(formula = Last_Post_Res ~ First_Pre_Res + Years_Factor,
right = 5, data = cohort4_final_gains)
Observations:
Total Left-censored Uncensored Right-censored
77 0 46 31
Coefficients:
Estimate Std. error t value Pr(> t)
(Intercept) -0.74299 0.78121 -0.951 0.342
First_Pre_Res 1.23420 0.18309 6.741 1.57e-11 ***
Years_Factor2 -0.04031 0.20612 -0.196 0.845
Years_Factor3 -0.00807 0.26734 -0.030 0.976
Years_Factor4 -0.09393 0.46368 -0.203 0.839
Years_Factor5 0.22080 0.48602 0.454 0.650
Years_Factor6 -0.04283 0.21849 -0.196 0.845
logSigma -0.48546 0.11425 -4.249 2.15e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Newton-Raphson maximisation, 7 iterations
Return code 1: gradient close to zero (gradtol)
Log-likelihood: -63.05633 on 8 Df
# Pulling the results into a data frame
tobit_coeffs <- as.data.frame(summary(tobit_res_years)$estimate)
tobit_coeffs$Variable <- rownames(tobit_coeffs)
# Filtering for just the Years_Factor rows
tobit_plot_data <- tobit_coeffs %>%
filter(grepl("Years_Factor", Variable))
ggplot(tobit_plot_data, aes(x = Variable, y = Estimate)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") + # Zero line
geom_point(size = 4, color = "#2c3e50") +
geom_errorbar(aes(ymin = Estimate - `Std. error`, ymax = Estimate + `Std. error`), width = 0.2) +
theme_minimal() +
coord_flip() + # Makes it easier to read variable names
labs(title = "Tobit Coefficients: Effect of Years of Experience",
subtitle = "Reference Group: Years_Factor 1 | Error bars represent Std. Error",
y = "Coefficient Estimate (Relative to Group 1)",
x = "Experience Group")# Pulling the results into a data frame
tobit_coeffs_se <- as.data.frame(summary(tobit_se_years)$estimate)
tobit_coeffs_se$Variable <- rownames(tobit_coeffs_se)
# Filtering for just the Years_Factor rows
tobit_seplot_data <- tobit_coeffs_se %>%
filter(grepl("Years_Factor", Variable))
ggplot(tobit_seplot_data, aes(x = Variable, y = Estimate)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") + # Zero line
geom_point(size = 4, color = "#2c3e50") +
geom_errorbar(aes(ymin = Estimate - `Std. error`, ymax = Estimate + `Std. error`), width = 0.2) +
theme_minimal() +
coord_flip() + # Makes it easier to read variable names
labs(title = "Tobit Coefficients: Effect of Years of Experience on Self-Efficacy",
subtitle = "Reference Group: Years_Factor 1 | Error bars represent Std. Error",
y = "Coefficient Estimate (Relative to Group 1)",
x = "Experience Group")# For Self-Efficacy
se_tests <- cohort4_final_gains %>%
group_by(Years_Factor) %>%
summarise(
n = n(),
mean_gain = mean(Gain_SelfEfficacy, na.rm = TRUE),
p_value = t.test(Gain_SelfEfficacy, mu = 0)$p.value
)
summary(se_tests) Years_Factor n mean_gain p_value
1:1 Min. : 2.00 Min. :-0.54167 Min. :0.03131
2:1 1st Qu.: 5.00 1st Qu.:-0.45395 1st Qu.:0.10306
3:1 Median :14.50 Median :-0.08952 Median :0.43888
4:1 Mean :12.83 Mean :-0.15590 Mean :0.37040
5:1 3rd Qu.:18.75 3rd Qu.: 0.05815 3rd Qu.:0.60277
6:1 Max. :24.00 Max. : 0.25338 Max. :0.66479
# For Resilience
resilience_tests <- cohort4_final_gains %>%
group_by(Years_Factor) %>%
summarise(
n = n(),
mean_gain = mean(Gain_Resilience, na.rm = TRUE),
p_value = t.test(Gain_Resilience, mu = 0)$p.value # Testing if mean is different from 0
)
summary(resilience_tests) Years_Factor n mean_gain p_value
1:1 Min. : 2.00 Min. :-0.16667 Min. :0.1227
2:1 1st Qu.: 5.00 1st Qu.: 0.03289 1st Qu.:0.1764
3:1 Median :14.50 Median : 0.13871 Median :0.3137
4:1 Mean :12.83 Mean : 0.09043 Mean :0.4520
5:1 3rd Qu.:18.75 3rd Qu.: 0.17282 3rd Qu.:0.7020
6:1 Max. :24.00 Max. : 0.25000 Max. :1.0000