Latent Profile Analysis (LPA) is a person-centered statistical approach that identifies unobserved subgroups (profiles) within a population based on patterns of responses across multiple continuous variables. Unlike variable-centered approaches that examine relationships between variables, LPA focuses on identifying types or classes of individuals who share similar response patterns.
library(tidyLPA) # Main LPA package
library(tidyverse) # Data manipulation and visualization
library(knitr) # For table formatting
library(kableExtra) # Enhanced table formatting
library(corrplot) # Correlation plots
library(psych) # Descriptive statistics
library(gt)
Package Functions: - tidyLPA
: Provides
user-friendly LPA functions with multiple model parameterizations -
tidyverse
: Data manipulation, visualization, and analysis -
knitr/kableExtra
: Professional table formatting -
corrplot
: Visualization of correlation matrices -
psych
: Descriptive statistics and psychometric
functions
Simulated Student Academic Performance Data
For this demonstration, we’ll use simulated student performance data with an adequate sample size for LPA. This dataset represents student scores across multiple academic domains.
Note on Sample Size: LPA requires adequate sample sizes - typically 50-100 cases per expected profile. We’ll simulate 300 students to ensure robust results.
# Set seed for reproducibility
set.seed(123)
# Simulate student performance data with three underlying profiles
n_total <- 300
# Profile 1: High Achievers (35% of sample)
n1 <- 105
high_achievers <- data.frame(
student_id = 1:n1,
math_score = rnorm(n1, mean = 85, sd = 8),
reading_score = rnorm(n1, mean = 88, sd = 7),
science_score = rnorm(n1, mean = 82, sd = 9),
writing_score = rnorm(n1, mean = 86, sd = 8),
critical_thinking = rnorm(n1, mean = 80, sd = 10),
true_profile = "High Achiever"
)
# Profile 2: Average Performers (45% of sample)
n2 <- 135
average_performers <- data.frame(
student_id = (n1+1):(n1+n2),
math_score = rnorm(n2, mean = 65, sd = 12),
reading_score = rnorm(n2, mean = 68, sd = 11),
science_score = rnorm(n2, mean = 62, sd = 13),
writing_score = rnorm(n2, mean = 66, sd = 12),
critical_thinking = rnorm(n2, mean = 60, sd = 14),
true_profile = "Average Performer"
)
# Profile 3: Struggling Students (20% of sample)
n3 <- 60
struggling_students <- data.frame(
student_id = (n1+n2+1):n_total,
math_score = rnorm(n3, mean = 45, sd = 15),
reading_score = rnorm(n3, mean = 48, sd = 14),
science_score = rnorm(n3, mean = 42, sd = 16),
writing_score = rnorm(n3, mean = 46, sd = 15),
critical_thinking = rnorm(n3, mean = 40, sd = 18),
true_profile = "Struggling Student"
)
# Combine all profiles
lpa_data <- rbind(high_achievers, average_performers, struggling_students) %>%
# Ensure scores are within realistic bounds (0-100)
mutate(
math_score = pmax(0, pmin(100, math_score)),
reading_score = pmax(0, pmin(100, reading_score)),
science_score = pmax(0, pmin(100, science_score)),
writing_score = pmax(0, pmin(100, writing_score)),
critical_thinking = pmax(0, pmin(100, critical_thinking))
) %>%
# Shuffle rows to mix profiles
sample_n(n_total)
# Basic information about the dataset
cat("Dataset dimensions:", dim(lpa_data), "\n")
Dataset dimensions: 300 7
Number of students: 300
Number of variables: 5 academic scores
# Variable descriptions
var_descriptions <- data.frame(
Variable = c("math_score", "reading_score", "science_score", "writing_score", "critical_thinking"),
Description = c("Mathematics achievement score (0-100)",
"Reading comprehension score (0-100)",
"Science knowledge score (0-100)",
"Writing proficiency score (0-100)",
"Critical thinking skills score (0-100)"),
`Expected Profiles` = c("High achievers should score ~85",
"Average performers should score ~68",
"Struggling students should score ~48",
"High achievers should score ~86",
"Pattern should differentiate profiles")
)
kable(var_descriptions, caption = "Variables in the Simulated Student Performance Dataset") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Variable | Description | Expected.Profiles |
---|---|---|
math_score | Mathematics achievement score (0-100) | High achievers should score ~85 |
reading_score | Reading comprehension score (0-100) | Average performers should score ~68 |
science_score | Science knowledge score (0-100) | Struggling students should score ~48 |
writing_score | Writing proficiency score (0-100) | High achievers should score ~86 |
critical_thinking | Critical thinking skills score (0-100) | Pattern should differentiate profiles |
# Select continuous variables for LPA
lpa_vars <- c("math_score", "reading_score", "science_score", "writing_score", "critical_thinking")
# Note: We keep the true_profile variable for validation but won't use it in LPA
cat("Sample size adequacy check:\n")
Sample size adequacy check:
Total sample: 300
Minimum recommended for 3 profiles: 150-300
Our sample: ADEQUATE ✓
# Display descriptive statistics
desc_stats <- psych::describe(lpa_data[, lpa_vars]) %>%
select(n, mean, sd, min, max, range) %>%
round(2)
kable(desc_stats, caption = "Descriptive Statistics for Academic Performance Variables") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
n | mean | sd | min | max | range | |
---|---|---|---|---|---|---|
math_score | 300 | 67.63 | 18.77 | 12.39 | 100 | 87.61 |
reading_score | 300 | 70.93 | 17.42 | 15.61 | 100 | 84.39 |
science_score | 300 | 65.81 | 19.91 | 10.09 | 100 | 89.91 |
writing_score | 300 | 69.04 | 18.04 | 5.57 | 100 | 94.43 |
critical_thinking | 300 | 63.65 | 19.66 | 4.13 | 100 | 95.87 |
# Show true profile distribution (for validation purposes)
true_profile_dist <- table(lpa_data$true_profile)
cat("\nTrue underlying profiles in simulated data:\n")
True underlying profiles in simulated data:
for(i in 1:length(true_profile_dist)) {
cat("+ ",names(true_profile_dist)[i], ":", true_profile_dist[i],
"(", round(100*true_profile_dist[i]/sum(true_profile_dist), 1), "%)\n")
}
# Examine correlations between variables
cor_matrix <- cor(lpa_data[, lpa_vars])
corrplot(cor_matrix, method = "color", type = "upper",
addCoef.col = "black", tl.cex = 0.8, number.cex = 0.7,
title = "Correlation Matrix of Academic Performance Variables")
Correlation insights:
Academic domains show moderate to high positive correlations,
suggesting students who perform well in one area tend to perform well in others.
This supports the existence of general academic performance profiles.
# LPA Configuration Parameters
config_params <- list(
# Number of profiles to test
n_profiles = 1:6,
# Variance specifications
variances = c("equal", "varying"),
# Covariance specifications
covariances = c("zero", "equal", "varying"),
# Model specifications (tidyLPA uses numbered models)
# Model 1: Equal variances, covariances = 0
# Model 2: Varying variances, covariances = 0
# Model 3: Equal variances, equal covariances
# Model 4: Varying variances, equal covariances
# Model 5: Equal variances, varying covariances
# Model 6: Varying variances, varying covariances
models = 1:6
)
# Display parameter explanations
param_explanations <- data.frame(
Parameter = c("n_profiles", "variances", "covariances", "models"),
Description = c(
"Number of latent profiles to test (typically 1-6)",
"Whether variances are equal or varying across profiles",
"Whether covariances are zero, equal, or varying across profiles",
"Numbered models combining variance/covariance specifications"
),
Recommendation = c(
"Test 1-6 profiles, select based on fit indices and interpretability",
"Start with 'varying' for flexibility, use 'equal' if convergence issues",
"Start with 'zero' for parsimony, add complexity if needed",
"Model 1 (most restrictive) to Model 6 (most flexible)"
)
)
kable(param_explanations, caption = "LPA Configuration Parameters") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
column_spec(3, width = "40%")
Parameter | Description | Recommendation |
---|---|---|
n_profiles | Number of latent profiles to test (typically 1-6) | Test 1-6 profiles, select based on fit indices and interpretability |
variances | Whether variances are equal or varying across profiles | Start with ‘varying’ for flexibility, use ‘equal’ if convergence issues |
covariances | Whether covariances are zero, equal, or varying across profiles | Start with ‘zero’ for parsimony, add complexity if needed |
models | Numbered models combining variance/covariance specifications | Model 1 (most restrictive) to Model 6 (most flexible) |
model_specs <- data.frame(
Model = 1:6,
Variances = c("Equal", "Varying", "Equal", "Varying", "Equal", "Varying"),
Covariances = c("Zero", "Zero", "Equal", "Equal", "Varying", "Varying"),
Description = c(
"Most restrictive: Same variance, no correlations",
"Varying variances, no correlations",
"Equal variances, same correlations across profiles",
"Varying variances, same correlations across profiles",
"Equal variances, different correlations per profile",
"Most flexible: Different variances and correlations per profile"
),
`When to Use` = c(
"Simple structure, similar profile scatter",
"Different profile scatter, no correlations",
"Profiles differ in correlations, not scatter",
"Profiles differ in both scatter and correlations",
"Complex correlation patterns across profiles",
"Maximum flexibility, large samples needed"
)
)
kable(model_specs, caption = "tidyLPA Model Specifications") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
column_spec(4, width = "30%") %>%
column_spec(5, width = "30%")
Model | Variances | Covariances | Description | When.to.Use |
---|---|---|---|---|
1 | Equal | Zero | Most restrictive: Same variance, no correlations | Simple structure, similar profile scatter |
2 | Varying | Zero | Varying variances, no correlations | Different profile scatter, no correlations |
3 | Equal | Equal | Equal variances, same correlations across profiles | Profiles differ in correlations, not scatter |
4 | Varying | Equal | Varying variances, same correlations across profiles | Profiles differ in both scatter and correlations |
5 | Equal | Varying | Equal variances, different correlations per profile | Complex correlation patterns across profiles |
6 | Varying | Varying | Most flexible: Different variances and correlations per profile | Maximum flexibility, large samples needed |
# Run LPA with multiple models and profile numbers
# We'll test Models 1, 2, and 3 (most common) with 1-5 profiles
cat("Running LPA models...\n")
Running LPA models…
This may take a few moments…
# Run LPA analysis
lpa_results <- lpa_data %>%
select(all_of(lpa_vars)) %>%
estimate_profiles(
n_profiles = 1:5,
models = 1:3, # Test three most common models
package = "mclust" # Use mclust backend for stability
)
# Extract fit statistics
fit_stats <- get_fit(lpa_results)
# Display fit statistics
fit_stats %>%
gt() %>%
tab_header(title = "Model Fit Statistics") %>%
fmt_number(columns = everything(), decimals = 2) %>%
opt_row_striping() %>%
tab_style(
style = cell_text(size = "small"),
locations = cells_body()
) %>%
tab_options(
table.font.size = "small",
data_row.padding = px(4)
)
Model Fit Statistics | |||||||||||||||||
Model | Classes | LogLik | AIC | AWE | BIC | CAIC | CLC | KIC | SABIC | ICL | Entropy | prob_min | prob_max | n_min | n_max | BLRT_val | BLRT_p |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1.00 | 1.00 | −6,521.68 | 13,063.35 | 13,185.43 | 13,100.39 | 13,110.39 | 13,045.35 | 13,076.35 | 13,068.68 | −13,100.39 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | NA | NA |
1.00 | 2.00 | −6,226.13 | 12,484.25 | 12,681.05 | 12,543.52 | 12,559.52 | 12,453.98 | 12,503.25 | 12,492.77 | −12,571.13 | 0.86 | 0.93 | 0.98 | 0.43 | 0.57 | 591.10 | 0.01 |
1.00 | 3.00 | −6,102.43 | 12,248.85 | 12,520.00 | 12,330.33 | 12,352.33 | 12,206.67 | 12,273.85 | 12,260.56 | −12,358.27 | 0.91 | 0.95 | 0.97 | 0.19 | 0.43 | 247.40 | 0.01 |
1.00 | 4.00 | −6,092.28 | 12,240.55 | 12,586.35 | 12,344.26 | 12,372.26 | 12,186.16 | 12,271.55 | 12,255.46 | −12,431.14 | 0.81 | 0.73 | 0.98 | 0.17 | 0.37 | 20.30 | 0.01 |
1.00 | 5.00 | −6,092.25 | 12,252.49 | 12,672.89 | 12,378.42 | 12,412.42 | 12,185.95 | 12,289.49 | 12,270.59 | −12,505.68 | 0.73 | 0.00 | 0.98 | 0.00 | 0.37 | 0.06 | 0.11 |
2.00 | 1.00 | −6,521.68 | 13,063.35 | 13,185.43 | 13,100.39 | 13,110.39 | 13,045.35 | 13,076.35 | 13,068.68 | −13,100.39 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | NA | NA |
2.00 | 2.00 | −6,106.14 | 12,254.27 | 12,512.88 | 12,332.05 | 12,353.05 | 12,214.23 | 12,278.27 | 12,265.45 | −12,334.90 | 0.98 | 1.00 | 1.00 | 0.37 | 0.63 | 831.08 | 0.01 |
2.00 | 3.00 | −6,013.91 | 12,091.82 | 12,487.05 | 12,210.34 | 12,242.34 | 12,029.64 | 12,126.82 | 12,108.86 | −12,235.31 | 0.91 | 0.92 | 0.99 | 0.20 | 0.44 | 184.46 | 0.01 |
2.00 | 4.00 | −6,000.55 | 12,087.11 | 12,619.00 | 12,246.37 | 12,289.37 | 12,002.74 | 12,133.11 | 12,110.00 | −12,333.56 | 0.81 | 0.76 | 1.00 | 0.16 | 0.37 | 26.71 | 0.09 |
2.00 | 5.00 | −5,990.45 | 12,088.91 | 12,757.28 | 12,288.91 | 12,342.91 | 11,982.54 | 12,145.91 | 12,117.66 | −12,387.85 | 0.82 | 0.76 | 0.97 | 0.06 | 0.31 | 20.20 | 0.32 |
3.00 | 1.00 | −6,148.84 | 12,337.68 | 12,583.83 | 12,411.75 | 12,431.75 | 12,299.68 | 12,360.68 | 12,348.32 | −12,411.75 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | NA | NA |
3.00 | 2.00 | −6,134.50 | 12,321.01 | 12,642.23 | 12,417.30 | 12,443.30 | 12,270.38 | 12,350.01 | 12,334.85 | −12,486.18 | 0.69 | 0.86 | 0.93 | 0.38 | 0.62 | 28.67 | 0.01 |
3.00 | 3.00 | −6,113.57 | 12,291.14 | 12,686.67 | 12,409.66 | 12,441.66 | 12,228.66 | 12,326.14 | 12,308.18 | −12,484.85 | 0.76 | 0.79 | 0.92 | 0.11 | 0.57 | 41.86 | 0.01 |
3.00 | 4.00 | −6,103.15 | 12,282.29 | 12,752.21 | 12,423.04 | 12,461.04 | 12,207.86 | 12,323.29 | 12,302.52 | −12,509.04 | 0.78 | 0.74 | 0.95 | 0.13 | 0.41 | 20.85 | 0.02 |
3.00 | 5.00 | −6,103.19 | 12,294.38 | 12,838.91 | 12,457.34 | 12,501.34 | 12,207.78 | 12,341.38 | 12,317.80 | −12,586.31 | 0.70 | 0.00 | 0.94 | 0.00 | 0.41 | −0.08 | 1.00 |
# Model selection guidelines
selection_criteria <- data.frame(
Criterion = c("AIC", "BIC", "Entropy", "BLRT p-value", "Profile Size"),
Description = c(
"Akaike Information Criterion - lower is better",
"Bayesian Information Criterion - lower is better, penalizes complexity more than AIC",
"Classification accuracy - higher is better (>0.80 preferred)",
"Bootstrap Likelihood Ratio Test - significant suggests k > k-1 profiles",
"Minimum profile size - avoid profiles <5% of sample"
),
`Good Value` = c(
"Lowest among compared models",
"Lowest among compared models (often preferred)",
"> 0.80 (excellent), > 0.70 (acceptable)",
"< 0.05 (significant improvement)",
"> 5% of total sample"
)
)
kable(selection_criteria, caption = "Model Selection Criteria") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
column_spec(2, width = "40%") %>%
column_spec(3, width = "30%")
Criterion | Description | Good.Value |
---|---|---|
AIC | Akaike Information Criterion - lower is better | Lowest among compared models |
BIC | Bayesian Information Criterion - lower is better, penalizes complexity more than AIC | Lowest among compared models (often preferred) |
Entropy | Classification accuracy - higher is better (>0.80 preferred) | > 0.80 (excellent), > 0.70 (acceptable) |
BLRT p-value | Bootstrap Likelihood Ratio Test - significant suggests k > k-1 profiles | < 0.05 (significant improvement) |
Profile Size | Minimum profile size - avoid profiles <5% of sample | > 5% of total sample |
# Find best model based on BIC (most common choice)
best_bic <- fit_stats %>%
filter(BIC == min(BIC, na.rm = TRUE))
cat("\nBest model based on BIC:\n")
Best model based on BIC:
Model: 2 with 3 profiles
BIC = 12210.34
Entropy = 0.908
# Based on fit statistics, let's select the best model
# The algorithm will determine the optimal number of profiles
final_model <- lpa_data %>%
select(all_of(lpa_vars)) %>%
estimate_profiles(n_profiles = 3, models = 1)
# Get profile assignments
profile_assignments <- get_data(final_model) %>%
bind_cols(student_id = lpa_data$student_id, true_profile = lpa_data$true_profile)
# Profile sizes
profile_sizes <- table(profile_assignments$Class)
profile_props <- prop.table(profile_sizes) * 100
cat("Final Model: 3 Profiles, Model 1\n")
Final Model: 3 Profiles, Model 1
Profile sizes:
for(i in 1:3) {
cat("Profile", i, ":", profile_sizes[i], "students (",
round(profile_props[i], 1), "%)\n")
}
Profile 1 : 58 students ( 19.3 %) Profile 2 : 114 students ( 38 %) Profile 3 : 128 students ( 42.7 %)
# Validation: Compare with true profiles
confusion_matrix <- table(
Predicted = profile_assignments$Class,
True = profile_assignments$true_profile
)
cat("\nValidation - Confusion Matrix:\n")
Validation - Confusion Matrix:
True
Predicted Average Performer High Achiever Struggling Student 1 2 0 56 2 9 105 0 3 124 0 4
# Calculate profile means
profile_means <- profile_assignments %>%
group_by(Class) %>%
summarise(
n = n(),
math_mean = mean(math_score),
reading_mean = mean(reading_score),
science_mean = mean(science_score),
writing_mean = mean(writing_score),
critical_thinking_mean = mean(critical_thinking),
.groups = 'drop'
)
kable(profile_means, caption = "Profile Means for Academic Performance", digits = 1) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Class | n | math_mean | reading_mean | science_mean | writing_mean | critical_thinking_mean |
---|---|---|---|---|---|---|
1 | 58 | 43.9 | 47.7 | 40.0 | 45.8 | 42.5 |
2 | 114 | 84.7 | 86.7 | 81.5 | 85.3 | 80.1 |
3 | 128 | 63.2 | 67.4 | 63.6 | 65.0 | 58.6 |
Profile Interpretations:
# Identify which profile corresponds to which performance level based on means
profiles_ordered <- profile_means %>% arrange(desc(math_mean))
cat("Profile", profiles_ordered$Class[1], " (n=", profiles_ordered$n[1], "):\n")
Profile 2 (n= 114 ):
cat("- Math:", round(profiles_ordered$math_mean[1], 1),
"- Reading:", round(profiles_ordered$reading_mean[1], 1),
"- Science:", round(profiles_ordered$science_mean[1], 1), "\n")
Interpretation: HIGH ACHIEVERS - Consistently high scores across all domains
Profile 3 (n= 128 ):
cat("- Math:", round(profiles_ordered$math_mean[2], 1),
"- Reading:", round(profiles_ordered$reading_mean[2], 1),
"- Science:", round(profiles_ordered$science_mean[2], 1), "\n")
Interpretation: AVERAGE PERFORMERS - Moderate performance across domains
Profile 1 (n= 58 ):
cat("- Math:", round(profiles_ordered$math_mean[3], 1),
"- Reading:", round(profiles_ordered$reading_mean[3], 1),
"- Science:", round(profiles_ordered$science_mean[3], 1), "\n")
Interpretation: STRUGGLING STUDENTS - Lower performance, may need additional support
# Since we have simulated data with known true profiles, we can validate our results
# Calculate accuracy of profile recovery
recovery_stats <- profile_assignments %>%
# Map LPA profiles to true profiles based on highest overlap
group_by(Class, true_profile) %>%
summarise(count = n(), .groups = 'drop') %>%
spread(true_profile, count, fill = 0)
cat("Profile Recovery Analysis:\n")
Profile Recovery Analysis:
kable(recovery_stats, caption = "Cross-tabulation: LPA Profiles vs True Profiles") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Class | Average Performer | High Achiever | Struggling Student |
---|---|---|---|
1 | 2 | 0 | 56 |
2 | 9 | 105 | 0 |
3 | 124 | 0 | 4 |
# Calculate overall accuracy
# This shows how well LPA recovered the underlying structure
max_assignments <- apply(confusion_matrix, 1, max)
total_correct <- sum(max_assignments)
overall_accuracy <- total_correct / sum(confusion_matrix)
cat("\nLPA Performance:\n")
LPA Performance:
Overall classification accuracy: 95 %
cat("This demonstrates that LPA successfully identified the underlying student performance profiles.\n")
This demonstrates that LPA successfully identified the underlying student performance profiles.
# Show sample of students from each profile with their scores
set.seed(456) # For reproducible sampling
sample_students <- profile_assignments %>%
select(student_id, Class, true_profile, math_score, reading_score,
science_score, writing_score, critical_thinking) %>%
group_by(Class) %>%
sample_n(min(5, n())) %>% # Sample up to 5 students per profile
arrange(Class, student_id) %>%
mutate(across(where(is.numeric) & !student_id, round, 1))
kable(sample_students, caption = "Sample of Student Classifications by Profile") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
student_id | Class | true_profile | math_score | reading_score | science_score | writing_score | critical_thinking |
---|---|---|---|---|---|---|---|
242 | 1 | Struggling Student | 33.6 | 58.5 | 10.1 | 52.3 | 64.7 |
258 | 1 | Struggling Student | 22.7 | 61.2 | 18.9 | 63.3 | 64.6 |
259 | 1 | Struggling Student | 27.6 | 46.5 | 38.7 | 48.5 | 53.7 |
265 | 1 | Struggling Student | 39.1 | 28.9 | 17.7 | 49.4 | 47.6 |
269 | 1 | Struggling Student | 51.0 | 20.1 | 28.2 | 20.5 | 43.1 |
18 | 2 | High Achiever | 69.3 | 84.6 | 75.5 | 80.2 | 90.8 |
31 | 2 | High Achiever | 88.4 | 95.9 | 74.9 | 72.0 | 94.3 |
60 | 2 | High Achiever | 86.7 | 85.1 | 82.8 | 90.1 | 77.7 |
63 | 2 | High Achiever | 82.3 | 84.6 | 88.2 | 87.2 | 93.3 |
97 | 2 | High Achiever | 100.0 | 97.2 | 63.9 | 88.1 | 78.7 |
121 | 3 | Average Performer | 61.7 | 46.4 | 29.4 | 67.2 | 86.7 |
145 | 3 | Average Performer | 50.8 | 50.2 | 59.5 | 59.0 | 62.2 |
158 | 3 | Average Performer | 65.9 | 67.0 | 75.6 | 62.3 | 53.6 |
182 | 3 | Average Performer | 64.7 | 74.8 | 63.9 | 69.0 | 55.4 |
230 | 3 | Average Performer | 76.9 | 53.5 | 82.7 | 79.9 | 29.2 |
Note: ‘true_profile’ shows the original simulation category for validation.
In real applications, you wouldn’t have this validation variable.
# Examine posterior probabilities
posterior_probs <- profile_assignments %>%
select(student_id, Class, CPROB1, CPROB2, CPROB3) %>%
rowwise() %>%
mutate(max_prob = max(c(CPROB1, CPROB2, CPROB3))) %>%
arrange(max_prob)
# Show cases with lowest classification certainty
uncertain_cases <- posterior_probs %>%
filter(max_prob < 0.80) %>%
select(student_id, Class, max_prob, CPROB1, CPROB2, CPROB3) %>%
head(10) # Show top 10 most uncertain cases
cat("Students with classification uncertainty (max probability < 0.80):\n")
Students with classification uncertainty (max probability < 0.80):
if(nrow(uncertain_cases) > 0) {
kable(uncertain_cases, digits = 3,
caption = "Students with Lower Classification Certainty") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
cat("\nInterpretation: These students show mixed academic patterns that don't\n")
cat("clearly fit into one profile. They might benefit from individualized assessment.\n")
} else {
cat("All students have high classification certainty (≥0.80)\n")
}
Interpretation: These students show mixed academic patterns that don’t clearly fit into one profile. They might benefit from individualized assessment.
Classification Quality Summary:
Mean maximum probability: 0.96
cat("Students with prob ≥ 0.90:", sum(posterior_probs$max_prob >= 0.90), "/", nrow(posterior_probs),
" (", round(100*sum(posterior_probs$max_prob >= 0.90)/nrow(posterior_probs), 1), "%)\n")
Students with prob ≥ 0.90: 265 / 300 ( 88.3 %)
cat("Students with prob ≥ 0.80:", sum(posterior_probs$max_prob >= 0.80), "/", nrow(posterior_probs),
" (", round(100*sum(posterior_probs$max_prob >= 0.80)/nrow(posterior_probs), 1), "%)\n")
Students with prob ≥ 0.80: 278 / 300 ( 92.7 %)
# Quality interpretation
mean_prob <- mean(posterior_probs$max_prob)
if(mean_prob >= 0.90) {
cat("\nExcellent classification quality - profiles are well-separated.\n")
} else if(mean_prob >= 0.80) {
cat("\nGood classification quality - most students clearly belong to one profile.\n")
} else {
cat("\nModerate classification quality - some profiles may overlap.\n")
}
Excellent classification quality - profiles are well-separated.
# Create profile plot
plot_data <- profile_means %>%
select(-n) %>%
pivot_longer(-Class, names_to = "Variable", values_to = "Mean") %>%
mutate(
Variable = case_when(
Variable == "math_mean" ~ "Mathematics",
Variable == "reading_mean" ~ "Reading",
Variable == "science_mean" ~ "Science",
Variable == "writing_mean" ~ "Writing",
Variable == "critical_thinking_mean" ~ "Critical Thinking"
),
Profile = paste("Profile", Class)
)
# Profile plot with actual scores (0-100 scale)
ggplot(plot_data, aes(x = Variable, y = Mean, group = Profile, color = Profile)) +
geom_line(size = 1.2) +
geom_point(size = 3) +
scale_color_manual(values = c("Profile 1" = "blue",
"Profile 2" = "red",
"Profile 3" = "green")) +
scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 20)) +
labs(
title = "Student Academic Performance Profiles",
subtitle = "Mean Scores Across Academic Domains (0-100 scale)",
x = "Academic Domains",
y = "Average Score",
color = "Profile"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(size = 14, face = "bold"),
legend.position = "bottom"
)
# Add horizontal reference lines
last_plot() +
geom_hline(yintercept = c(60, 80), linetype = "dashed", alpha = 0.5) +
annotate("text", x = 5, y = 82, label = "High Performance (80+)", size = 3) +
annotate("text", x = 5, y = 62, label = "Average Performance (60+)", size = 3)
# Scatter plot of key variables colored by profile
library(GGally)
scatter_data <- profile_assignments %>%
select(math_score, reading_score, science_score, writing_score, Class) %>%
mutate(Profile = as.factor(paste("Profile", Class)))
ggpairs(scatter_data,
columns = 1:4,
aes(color = Profile, alpha = 0.7),
title = "Academic Performance Scatter Plot Matrix by Profile") +
theme_minimal()
Interpretation: The scatter plots show how well the profiles separate across different pairs of academic variables. Clear clustering indicates distinct student performance patterns.
# Box plots for each variable by profile
plot_list <- list()
variables <- c("math_score", "reading_score", "science_score", "writing_score", "critical_thinking")
var_labels <- c("Mathematics", "Reading", "Science", "Writing", "Critical Thinking")
for(i in 1:length(variables)) {
plot_list[[i]] <- profile_assignments %>%
mutate(Profile = as.factor(paste("Profile", Class))) %>%
ggplot(aes_string(x = "Profile", y = variables[i], fill = "Profile")) +
geom_boxplot(alpha = 0.7) +
scale_fill_manual(values = c("Profile 1" = "lightblue",
"Profile 2" = "lightcoral",
"Profile 3" = "lightgreen")) +
labs(title = var_labels[i], y = "Score") +
ylim(0, 100) +
theme_minimal() +
theme(legend.position = "none")
}
# Arrange plots
library(gridExtra)
grid.arrange(grobs = plot_list, ncol = 3,
top = "Distribution of Academic Scores Across Profiles")
Interpretation: Box plots show the distribution and overlap of scores
within each profile. Clear separation indicates distinct performance levels,
while overlap suggests some students are on the boundary between profiles.
Dataset: Simulated Student Academic Performance Data Sample size: 300 students Variables analyzed: Mathematics, Reading, Science, Writing, Critical Thinking
OPTIMAL MODEL:
profiles_ordered <- profile_means %>% arrange(desc(math_mean))
PROFILE DESCRIPTIONS:
Profile 2 - High Achievers (38% of sample):
Profile 3 - Average Performers (42.7% of sample):
Profile 1 - Struggling Students (19.3% of sample):
PRACTICAL IMPLICATIONS:
STRENGTHS OF THIS ANALYSIS:
LIMITATIONS AND CONSIDERATIONS:
RECOMMENDATIONS FOR YOUR OWN LPA:
Collins, L. M., & Lanza, S. T. (2010). Latent class and latent transition analysis: With applications in the social, behavioral, and health sciences. John Wiley & Sons.
Spurk, D., Hirschi, A., Wang, M., Valero, D., & Kauffeld, S. (2020). Latent profile analysis: A review and “how to” guide of its application within vocational behavior research. Journal of Vocational Behavior, 120, 103445.
Rosenberg, J. M., Beymer, P. N., Anderson, D. J., Van Lissa, C. J., & Schmidt, J. A. (2018). tidyLPA: An R package to easily carry out Latent Profile Analysis (LPA) using open-source or commercial software. Journal of Open Source Software, 3(30), 978.
Additional Resources: - tidyLPA documentation: https://data-edu.github.io/tidyLPA/ - Mplus User’s Guide for mixture modeling - CRAN Task View: Cluster Analysis & Finite Mixture Models