1 Introduction to Latent Profile Analysis

Latent Profile Analysis (LPA) is a person-centered statistical approach that identifies unobserved subgroups (profiles) within a population based on patterns of responses across multiple continuous variables. Unlike variable-centered approaches that examine relationships between variables, LPA focuses on identifying types or classes of individuals who share similar response patterns.

1.1 Key Concepts:

Person-centered approach: Groups people rather than variables
Probabilistic classification: Individuals have probabilities of belonging to each profile
Mixture modeling: Assumes the population consists of a mixture of distinct subgroups
Profile interpretation: Each profile is characterized by its pattern of means across indicator variables

1.2 When to Use LPA:

You want to identify distinct subgroups in your population
You have multiple continuous indicator variables
You’re interested in typologies or classification schemes
You want to understand heterogeneity in your sample

1.3 Required Packages

library(tidyLPA)      # Main LPA package
library(tidyverse)    # Data manipulation and visualization
library(knitr)        # For table formatting
library(kableExtra)   # Enhanced table formatting
library(corrplot)     # Correlation plots
library(psych)        # Descriptive statistics
library(gt)

Package Functions: - tidyLPA: Provides user-friendly LPA functions with multiple model parameterizations - tidyverse: Data manipulation, visualization, and analysis - knitr/kableExtra: Professional table formatting - corrplot: Visualization of correlation matrices - psych: Descriptive statistics and psychometric functions

2 Dataset

Simulated Student Academic Performance Data

For this demonstration, we’ll use simulated student performance data with an adequate sample size for LPA. This dataset represents student scores across multiple academic domains.

Note on Sample Size: LPA requires adequate sample sizes - typically 50-100 cases per expected profile. We’ll simulate 300 students to ensure robust results.

# Set seed for reproducibility
set.seed(123)

# Simulate student performance data with three underlying profiles
n_total <- 300

# Profile 1: High Achievers (35% of sample)
n1 <- 105
high_achievers <- data.frame(
  student_id = 1:n1,
  math_score = rnorm(n1, mean = 85, sd = 8),
  reading_score = rnorm(n1, mean = 88, sd = 7),
  science_score = rnorm(n1, mean = 82, sd = 9),
  writing_score = rnorm(n1, mean = 86, sd = 8),
  critical_thinking = rnorm(n1, mean = 80, sd = 10),
  true_profile = "High Achiever"
)

# Profile 2: Average Performers (45% of sample) 
n2 <- 135
average_performers <- data.frame(
  student_id = (n1+1):(n1+n2),
  math_score = rnorm(n2, mean = 65, sd = 12),
  reading_score = rnorm(n2, mean = 68, sd = 11),
  science_score = rnorm(n2, mean = 62, sd = 13),
  writing_score = rnorm(n2, mean = 66, sd = 12),
  critical_thinking = rnorm(n2, mean = 60, sd = 14),
  true_profile = "Average Performer"
)

# Profile 3: Struggling Students (20% of sample)
n3 <- 60
struggling_students <- data.frame(
  student_id = (n1+n2+1):n_total,
  math_score = rnorm(n3, mean = 45, sd = 15),
  reading_score = rnorm(n3, mean = 48, sd = 14),
  science_score = rnorm(n3, mean = 42, sd = 16),
  writing_score = rnorm(n3, mean = 46, sd = 15),
  critical_thinking = rnorm(n3, mean = 40, sd = 18),
  true_profile = "Struggling Student"
)

# Combine all profiles
lpa_data <- rbind(high_achievers, average_performers, struggling_students) %>%
  # Ensure scores are within realistic bounds (0-100)
  mutate(
    math_score = pmax(0, pmin(100, math_score)),
    reading_score = pmax(0, pmin(100, reading_score)),
    science_score = pmax(0, pmin(100, science_score)),
    writing_score = pmax(0, pmin(100, writing_score)),
    critical_thinking = pmax(0, pmin(100, critical_thinking))
  ) %>%
  # Shuffle rows to mix profiles
  sample_n(n_total)

# Basic information about the dataset
cat("Dataset dimensions:", dim(lpa_data), "\n")

Dataset dimensions: 300 7

cat("Number of students:", nrow(lpa_data), "\n")

Number of students: 300

cat("Number of variables:", ncol(lpa_data) - 2, "academic scores\n\n")

Number of variables: 5 academic scores


# Variable descriptions
var_descriptions <- data.frame(
  Variable = c("math_score", "reading_score", "science_score", "writing_score", "critical_thinking"),
  Description = c("Mathematics achievement score (0-100)", 
                  "Reading comprehension score (0-100)", 
                  "Science knowledge score (0-100)",
                  "Writing proficiency score (0-100)", 
                  "Critical thinking skills score (0-100)"),
  `Expected Profiles` = c("High achievers should score ~85", 
                         "Average performers should score ~68",
                         "Struggling students should score ~48",
                         "High achievers should score ~86",
                         "Pattern should differentiate profiles")
)

kable(var_descriptions, caption = "Variables in the Simulated Student Performance Dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Variables in the Simulated Student Performance Dataset
Variable	Description	Expected.Profiles
math_score	Mathematics achievement score (0-100)	High achievers should score ~85
reading_score	Reading comprehension score (0-100)	Average performers should score ~68
science_score	Science knowledge score (0-100)	Struggling students should score ~48
writing_score	Writing proficiency score (0-100)	High achievers should score ~86
critical_thinking	Critical thinking skills score (0-100)	Pattern should differentiate profiles

# Select continuous variables for LPA
lpa_vars <- c("math_score", "reading_score", "science_score", "writing_score", "critical_thinking")

# Note: We keep the true_profile variable for validation but won't use it in LPA
cat("Sample size adequacy check:\n")

Sample size adequacy check:

cat("Total sample:", nrow(lpa_data), "\n")

Total sample: 300

cat("Minimum recommended for 3 profiles: 150-300\n")

Minimum recommended for 3 profiles: 150-300

cat("Our sample: ADEQUATE ✓\n\n")

Our sample: ADEQUATE ✓


# Display descriptive statistics
desc_stats <- psych::describe(lpa_data[, lpa_vars]) %>%
  select(n, mean, sd, min, max, range) %>%
  round(2)

kable(desc_stats, caption = "Descriptive Statistics for Academic Performance Variables") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Descriptive Statistics for Academic Performance Variables
	n	mean	sd	min	max	range
math_score	300	67.63	18.77	12.39	100	87.61
reading_score	300	70.93	17.42	15.61	100	84.39
science_score	300	65.81	19.91	10.09	100	89.91
writing_score	300	69.04	18.04	5.57	100	94.43
critical_thinking	300	63.65	19.66	4.13	100	95.87


# Show true profile distribution (for validation purposes)
true_profile_dist <- table(lpa_data$true_profile)
cat("\nTrue underlying profiles in simulated data:\n")

True underlying profiles in simulated data:

for(i in 1:length(true_profile_dist)) {
  cat("+ ",names(true_profile_dist)[i], ":", true_profile_dist[i], 
      "(", round(100*true_profile_dist[i]/sum(true_profile_dist), 1), "%)\n")
}

Average Performer : 135 ( 45 %)
High Achiever : 105 ( 35 %)
Struggling Student : 60 ( 20 %)

# Examine correlations between variables
cor_matrix <- cor(lpa_data[, lpa_vars])
corrplot(cor_matrix, method = "color", type = "upper", 
         addCoef.col = "black", tl.cex = 0.8, number.cex = 0.7,
         title = "Correlation Matrix of Academic Performance Variables")


cat("\nCorrelation insights:\n")

Correlation insights:

cat("Academic domains show moderate to high positive correlations,\n")

Academic domains show moderate to high positive correlations,

cat("suggesting students who perform well in one area tend to perform well in others.\n")

suggesting students who perform well in one area tend to perform well in others.

cat("This supports the existence of general academic performance profiles.\n")

This supports the existence of general academic performance profiles.

3 LPA Configuration Parameters

3.1 Core Parameters

# LPA Configuration Parameters
config_params <- list(
  # Number of profiles to test
  n_profiles = 1:6,
  
  # Variance specifications
  variances = c("equal", "varying"),
  
  # Covariance specifications  
  covariances = c("zero", "equal", "varying"),
  
  # Model specifications (tidyLPA uses numbered models)
  # Model 1: Equal variances, covariances = 0
  # Model 2: Varying variances, covariances = 0  
  # Model 3: Equal variances, equal covariances
  # Model 4: Varying variances, equal covariances
  # Model 5: Equal variances, varying covariances
  # Model 6: Varying variances, varying covariances
  models = 1:6
)

# Display parameter explanations
param_explanations <- data.frame(
  Parameter = c("n_profiles", "variances", "covariances", "models"),
  Description = c(
    "Number of latent profiles to test (typically 1-6)",
    "Whether variances are equal or varying across profiles",
    "Whether covariances are zero, equal, or varying across profiles",
    "Numbered models combining variance/covariance specifications"
  ),
  Recommendation = c(
    "Test 1-6 profiles, select based on fit indices and interpretability",
    "Start with 'varying' for flexibility, use 'equal' if convergence issues",
    "Start with 'zero' for parsimony, add complexity if needed",
    "Model 1 (most restrictive) to Model 6 (most flexible)"
  )
)

kable(param_explanations, caption = "LPA Configuration Parameters") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  column_spec(3, width = "40%")

LPA Configuration Parameters
Parameter	Description	Recommendation
n_profiles	Number of latent profiles to test (typically 1-6)	Test 1-6 profiles, select based on fit indices and interpretability
variances	Whether variances are equal or varying across profiles	Start with ‘varying’ for flexibility, use ‘equal’ if convergence issues
covariances	Whether covariances are zero, equal, or varying across profiles	Start with ‘zero’ for parsimony, add complexity if needed
models	Numbered models combining variance/covariance specifications	Model 1 (most restrictive) to Model 6 (most flexible)

3.2 Model Specifications in tidyLPA

model_specs <- data.frame(
  Model = 1:6,
  Variances = c("Equal", "Varying", "Equal", "Varying", "Equal", "Varying"),
  Covariances = c("Zero", "Zero", "Equal", "Equal", "Varying", "Varying"),
  Description = c(
    "Most restrictive: Same variance, no correlations",
    "Varying variances, no correlations", 
    "Equal variances, same correlations across profiles",
    "Varying variances, same correlations across profiles",
    "Equal variances, different correlations per profile",
    "Most flexible: Different variances and correlations per profile"
  ),
  `When to Use` = c(
    "Simple structure, similar profile scatter",
    "Different profile scatter, no correlations",
    "Profiles differ in correlations, not scatter",
    "Profiles differ in both scatter and correlations",
    "Complex correlation patterns across profiles",
    "Maximum flexibility, large samples needed"
  )
)

kable(model_specs, caption = "tidyLPA Model Specifications") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  column_spec(4, width = "30%") %>%
  column_spec(5, width = "30%")

tidyLPA Model Specifications
Model	Variances	Covariances	Description	When.to.Use
1	Equal	Zero	Most restrictive: Same variance, no correlations	Simple structure, similar profile scatter
2	Varying	Zero	Varying variances, no correlations	Different profile scatter, no correlations
3	Equal	Equal	Equal variances, same correlations across profiles	Profiles differ in correlations, not scatter
4	Varying	Equal	Varying variances, same correlations across profiles	Profiles differ in both scatter and correlations
5	Equal	Varying	Equal variances, different correlations per profile	Complex correlation patterns across profiles
6	Varying	Varying	Most flexible: Different variances and correlations per profile	Maximum flexibility, large samples needed

4 Running LPA Analysis

4.1 Model Comparison Approach

# Run LPA with multiple models and profile numbers
# We'll test Models 1, 2, and 3 (most common) with 1-5 profiles

cat("Running LPA models...\n")

Running LPA models…

cat("This may take a few moments...\n\n")

This may take a few moments…


# Run LPA analysis
lpa_results <- lpa_data %>%
  select(all_of(lpa_vars)) %>%
  estimate_profiles(
    n_profiles = 1:5,
    models = 1:3,  # Test three most common models
    package = "mclust"  # Use mclust backend for stability
  )

# Extract fit statistics
fit_stats <- get_fit(lpa_results)

# Display fit statistics
fit_stats %>%
  gt() %>%
  tab_header(title = "Model Fit Statistics") %>%
  fmt_number(columns = everything(), decimals = 2) %>%
  opt_row_striping() %>%
  tab_style(
    style = cell_text(size = "small"),
    locations = cells_body()
  ) %>%
  tab_options(
    table.font.size = "small",
    data_row.padding = px(4)
  )

Model	Classes	LogLik	AIC	AWE	BIC	CAIC	CLC	KIC	SABIC	ICL	Entropy	prob_min	prob_max	n_min	n_max	BLRT_val	BLRT_p
Model Fit Statistics
1.00	1.00	−6,521.68	13,063.35	13,185.43	13,100.39	13,110.39	13,045.35	13,076.35	13,068.68	−13,100.39	1.00	1.00	1.00	1.00	1.00	NA	NA
1.00	2.00	−6,226.13	12,484.25	12,681.05	12,543.52	12,559.52	12,453.98	12,503.25	12,492.77	−12,571.13	0.86	0.93	0.98	0.43	0.57	591.10	0.01
1.00	3.00	−6,102.43	12,248.85	12,520.00	12,330.33	12,352.33	12,206.67	12,273.85	12,260.56	−12,358.27	0.91	0.95	0.97	0.19	0.43	247.40	0.01
1.00	4.00	−6,092.28	12,240.55	12,586.35	12,344.26	12,372.26	12,186.16	12,271.55	12,255.46	−12,431.14	0.81	0.73	0.98	0.17	0.37	20.30	0.01
1.00	5.00	−6,092.25	12,252.49	12,672.89	12,378.42	12,412.42	12,185.95	12,289.49	12,270.59	−12,505.68	0.73	0.00	0.98	0.00	0.37	0.06	0.11
2.00	1.00	−6,521.68	13,063.35	13,185.43	13,100.39	13,110.39	13,045.35	13,076.35	13,068.68	−13,100.39	1.00	1.00	1.00	1.00	1.00	NA	NA
2.00	2.00	−6,106.14	12,254.27	12,512.88	12,332.05	12,353.05	12,214.23	12,278.27	12,265.45	−12,334.90	0.98	1.00	1.00	0.37	0.63	831.08	0.01
2.00	3.00	−6,013.91	12,091.82	12,487.05	12,210.34	12,242.34	12,029.64	12,126.82	12,108.86	−12,235.31	0.91	0.92	0.99	0.20	0.44	184.46	0.01
2.00	4.00	−6,000.55	12,087.11	12,619.00	12,246.37	12,289.37	12,002.74	12,133.11	12,110.00	−12,333.56	0.81	0.76	1.00	0.16	0.37	26.71	0.09
2.00	5.00	−5,990.45	12,088.91	12,757.28	12,288.91	12,342.91	11,982.54	12,145.91	12,117.66	−12,387.85	0.82	0.76	0.97	0.06	0.31	20.20	0.32
3.00	1.00	−6,148.84	12,337.68	12,583.83	12,411.75	12,431.75	12,299.68	12,360.68	12,348.32	−12,411.75	1.00	1.00	1.00	1.00	1.00	NA	NA
3.00	2.00	−6,134.50	12,321.01	12,642.23	12,417.30	12,443.30	12,270.38	12,350.01	12,334.85	−12,486.18	0.69	0.86	0.93	0.38	0.62	28.67	0.01
3.00	3.00	−6,113.57	12,291.14	12,686.67	12,409.66	12,441.66	12,228.66	12,326.14	12,308.18	−12,484.85	0.76	0.79	0.92	0.11	0.57	41.86	0.01
3.00	4.00	−6,103.15	12,282.29	12,752.21	12,423.04	12,461.04	12,207.86	12,323.29	12,302.52	−12,509.04	0.78	0.74	0.95	0.13	0.41	20.85	0.02
3.00	5.00	−6,103.19	12,294.38	12,838.91	12,457.34	12,501.34	12,207.78	12,341.38	12,317.80	−12,586.31	0.70	0.00	0.94	0.00	0.41	−0.08	1.00

4.2 Model Selection Criteria

# Model selection guidelines
selection_criteria <- data.frame(
  Criterion = c("AIC", "BIC", "Entropy", "BLRT p-value", "Profile Size"),
  Description = c(
    "Akaike Information Criterion - lower is better",
    "Bayesian Information Criterion - lower is better, penalizes complexity more than AIC", 
    "Classification accuracy - higher is better (>0.80 preferred)",
    "Bootstrap Likelihood Ratio Test - significant suggests k > k-1 profiles",
    "Minimum profile size - avoid profiles <5% of sample"
  ),
  `Good Value` = c(
    "Lowest among compared models",
    "Lowest among compared models (often preferred)",
    "> 0.80 (excellent), > 0.70 (acceptable)", 
    "< 0.05 (significant improvement)",
    "> 5% of total sample"
  )
)

kable(selection_criteria, caption = "Model Selection Criteria") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  column_spec(2, width = "40%") %>%
  column_spec(3, width = "30%")

Model Selection Criteria
Criterion	Description	Good.Value
AIC	Akaike Information Criterion - lower is better	Lowest among compared models
BIC	Bayesian Information Criterion - lower is better, penalizes complexity more than AIC	Lowest among compared models (often preferred)
Entropy	Classification accuracy - higher is better (>0.80 preferred)	> 0.80 (excellent), > 0.70 (acceptable)
BLRT p-value	Bootstrap Likelihood Ratio Test - significant suggests k > k-1 profiles	< 0.05 (significant improvement)
Profile Size	Minimum profile size - avoid profiles <5% of sample	> 5% of total sample


# Find best model based on BIC (most common choice)
best_bic <- fit_stats %>% 
  filter(BIC == min(BIC, na.rm = TRUE))

cat("\nBest model based on BIC:\n")

Best model based on BIC:

cat("Model:", best_bic$Model, "with", best_bic$Classes, "profiles\n")

Model: 2 with 3 profiles

cat("BIC =", round(best_bic$BIC, 2), "\n")

BIC = 12210.34

cat("Entropy =", round(best_bic$Entropy, 3), "\n")

Entropy = 0.908

4.3 Final Model Selection and Results

# Based on fit statistics, let's select the best model
# The algorithm will determine the optimal number of profiles
final_model <- lpa_data %>%
  select(all_of(lpa_vars)) %>%
  estimate_profiles(n_profiles = 3, models = 1)

# Get profile assignments
profile_assignments <- get_data(final_model) %>%
  bind_cols(student_id = lpa_data$student_id, true_profile = lpa_data$true_profile)

# Profile sizes
profile_sizes <- table(profile_assignments$Class)
profile_props <- prop.table(profile_sizes) * 100

cat("Final Model: 3 Profiles, Model 1\n")

Final Model: 3 Profiles, Model 1

cat("Profile sizes:\n")

Profile sizes:

for(i in 1:3) {
  cat("Profile", i, ":", profile_sizes[i], "students (", 
      round(profile_props[i], 1), "%)\n")
}

Profile 1 : 58 students ( 19.3 %) Profile 2 : 114 students ( 38 %) Profile 3 : 128 students ( 42.7 %)


# Validation: Compare with true profiles
confusion_matrix <- table(
  Predicted = profile_assignments$Class,
  True = profile_assignments$true_profile
)
cat("\nValidation - Confusion Matrix:\n")

Validation - Confusion Matrix:

print(confusion_matrix)

     True

Predicted Average Performer High Achiever Struggling Student 1 2 0 56 2 9 105 0 3 124 0 4

5 Results Interpretation

5.1 Profile Characteristics

# Calculate profile means
profile_means <- profile_assignments %>%
  group_by(Class) %>%
  summarise(
    n = n(),
    math_mean = mean(math_score),
    reading_mean = mean(reading_score), 
    science_mean = mean(science_score),
    writing_mean = mean(writing_score),
    critical_thinking_mean = mean(critical_thinking),
    .groups = 'drop'
  )

kable(profile_means, caption = "Profile Means for Academic Performance", digits = 1) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Profile Means for Academic Performance
Class	n	math_mean	reading_mean	science_mean	writing_mean	critical_thinking_mean
1	58	43.9	47.7	40.0	45.8	42.5
2	114	84.7	86.7	81.5	85.3	80.1
3	128	63.2	67.4	63.6	65.0	58.6


# Profile interpretations based on means
cat("\nProfile Interpretations:\n\n")

Profile Interpretations:


# Identify which profile corresponds to which performance level based on means
profiles_ordered <- profile_means %>% arrange(desc(math_mean))

cat("Profile", profiles_ordered$Class[1], " (n=", profiles_ordered$n[1], "):\n")

Profile 2 (n= 114 ):

cat("- Math:", round(profiles_ordered$math_mean[1], 1), 
    "- Reading:", round(profiles_ordered$reading_mean[1], 1),
    "- Science:", round(profiles_ordered$science_mean[1], 1), "\n")

Math: 84.7 - Reading: 86.7 - Science: 81.5

cat("Interpretation: HIGH ACHIEVERS - Consistently high scores across all domains\n\n")

Interpretation: HIGH ACHIEVERS - Consistently high scores across all domains


cat("Profile", profiles_ordered$Class[2], " (n=", profiles_ordered$n[2], "):\n")

Profile 3 (n= 128 ):

cat("- Math:", round(profiles_ordered$math_mean[2], 1),
    "- Reading:", round(profiles_ordered$reading_mean[2], 1), 
    "- Science:", round(profiles_ordered$science_mean[2], 1), "\n")

Math: 63.2 - Reading: 67.4 - Science: 63.6

cat("Interpretation: AVERAGE PERFORMERS - Moderate performance across domains\n\n")

Interpretation: AVERAGE PERFORMERS - Moderate performance across domains


cat("Profile", profiles_ordered$Class[3], " (n=", profiles_ordered$n[3], "):\n")

Profile 1 (n= 58 ):

cat("- Math:", round(profiles_ordered$math_mean[3], 1),
    "- Reading:", round(profiles_ordered$reading_mean[3], 1),
    "- Science:", round(profiles_ordered$science_mean[3], 1), "\n")

Math: 43.9 - Reading: 47.7 - Science: 40

cat("Interpretation: STRUGGLING STUDENTS - Lower performance, may need additional support\n")

Interpretation: STRUGGLING STUDENTS - Lower performance, may need additional support

5.2 Model Validation

# Since we have simulated data with known true profiles, we can validate our results
# Calculate accuracy of profile recovery
recovery_stats <- profile_assignments %>%
  # Map LPA profiles to true profiles based on highest overlap
  group_by(Class, true_profile) %>%
  summarise(count = n(), .groups = 'drop') %>%
  spread(true_profile, count, fill = 0)

cat("Profile Recovery Analysis:\n")

Profile Recovery Analysis:

kable(recovery_stats, caption = "Cross-tabulation: LPA Profiles vs True Profiles") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Cross-tabulation: LPA Profiles vs True Profiles
Class	Average Performer	High Achiever	Struggling Student
1	2	0	56
2	9	105	0
3	124	0	4


# Calculate overall accuracy
# This shows how well LPA recovered the underlying structure
max_assignments <- apply(confusion_matrix, 1, max)
total_correct <- sum(max_assignments)
overall_accuracy <- total_correct / sum(confusion_matrix)

cat("\nLPA Performance:\n")

LPA Performance:

cat("Overall classification accuracy:", round(100 * overall_accuracy, 1), "%\n")

Overall classification accuracy: 95 %

cat("This demonstrates that LPA successfully identified the underlying student performance profiles.\n")

This demonstrates that LPA successfully identified the underlying student performance profiles.

5.3 Sample of Student Classifications

# Show sample of students from each profile with their scores
set.seed(456)  # For reproducible sampling

sample_students <- profile_assignments %>%
  select(student_id, Class, true_profile, math_score, reading_score, 
         science_score, writing_score, critical_thinking) %>%
  group_by(Class) %>%
  sample_n(min(5, n())) %>%  # Sample up to 5 students per profile
  arrange(Class, student_id) %>%
  mutate(across(where(is.numeric) & !student_id, round, 1))

kable(sample_students, caption = "Sample of Student Classifications by Profile") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Sample of Student Classifications by Profile
student_id	Class	true_profile	math_score	reading_score	science_score	writing_score	critical_thinking
242	1	Struggling Student	33.6	58.5	10.1	52.3	64.7
258	1	Struggling Student	22.7	61.2	18.9	63.3	64.6
259	1	Struggling Student	27.6	46.5	38.7	48.5	53.7
265	1	Struggling Student	39.1	28.9	17.7	49.4	47.6
269	1	Struggling Student	51.0	20.1	28.2	20.5	43.1
18	2	High Achiever	69.3	84.6	75.5	80.2	90.8
31	2	High Achiever	88.4	95.9	74.9	72.0	94.3
60	2	High Achiever	86.7	85.1	82.8	90.1	77.7
63	2	High Achiever	82.3	84.6	88.2	87.2	93.3
97	2	High Achiever	100.0	97.2	63.9	88.1	78.7
121	3	Average Performer	61.7	46.4	29.4	67.2	86.7
145	3	Average Performer	50.8	50.2	59.5	59.0	62.2
158	3	Average Performer	65.9	67.0	75.6	62.3	53.6
182	3	Average Performer	64.7	74.8	63.9	69.0	55.4
230	3	Average Performer	76.9	53.5	82.7	79.9	29.2


cat("\nNote: 'true_profile' shows the original simulation category for validation.\n")

Note: ‘true_profile’ shows the original simulation category for validation.

cat("In real applications, you wouldn't have this validation variable.\n")

In real applications, you wouldn’t have this validation variable.

5.4 Classification Uncertainty

# Examine posterior probabilities
posterior_probs <- profile_assignments %>%
  select(student_id, Class, CPROB1, CPROB2, CPROB3) %>%
  rowwise() %>%
  mutate(max_prob = max(c(CPROB1, CPROB2, CPROB3))) %>%
  arrange(max_prob)

# Show cases with lowest classification certainty
uncertain_cases <- posterior_probs %>%
  filter(max_prob < 0.80) %>%
  select(student_id, Class, max_prob, CPROB1, CPROB2, CPROB3) %>%
  head(10)  # Show top 10 most uncertain cases

cat("Students with classification uncertainty (max probability < 0.80):\n")

Students with classification uncertainty (max probability < 0.80):

if(nrow(uncertain_cases) > 0) {
  kable(uncertain_cases, digits = 3, 
        caption = "Students with Lower Classification Certainty") %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
  
  cat("\nInterpretation: These students show mixed academic patterns that don't\n")
  cat("clearly fit into one profile. They might benefit from individualized assessment.\n")
} else {
  cat("All students have high classification certainty (≥0.80)\n")
}

Interpretation: These students show mixed academic patterns that don’t clearly fit into one profile. They might benefit from individualized assessment.


# Overall classification quality
cat("\nClassification Quality Summary:\n")

Classification Quality Summary:

cat("Mean maximum probability:", round(mean(posterior_probs$max_prob), 3), "\n")

Mean maximum probability: 0.96

cat("Students with prob ≥ 0.90:", sum(posterior_probs$max_prob >= 0.90), "/", nrow(posterior_probs), 
    " (", round(100*sum(posterior_probs$max_prob >= 0.90)/nrow(posterior_probs), 1), "%)\n")

Students with prob ≥ 0.90: 265 / 300 ( 88.3 %)

cat("Students with prob ≥ 0.80:", sum(posterior_probs$max_prob >= 0.80), "/", nrow(posterior_probs),
    " (", round(100*sum(posterior_probs$max_prob >= 0.80)/nrow(posterior_probs), 1), "%)\n")

Students with prob ≥ 0.80: 278 / 300 ( 92.7 %)


# Quality interpretation
mean_prob <- mean(posterior_probs$max_prob)
if(mean_prob >= 0.90) {
  cat("\nExcellent classification quality - profiles are well-separated.\n")
} else if(mean_prob >= 0.80) {
  cat("\nGood classification quality - most students clearly belong to one profile.\n")
} else {
  cat("\nModerate classification quality - some profiles may overlap.\n")
}

Excellent classification quality - profiles are well-separated.

6 Visualization

6.1 Profile Plot

# Create profile plot
plot_data <- profile_means %>%
  select(-n) %>%
  pivot_longer(-Class, names_to = "Variable", values_to = "Mean") %>%
  mutate(
    Variable = case_when(
      Variable == "math_mean" ~ "Mathematics",
      Variable == "reading_mean" ~ "Reading", 
      Variable == "science_mean" ~ "Science",
      Variable == "writing_mean" ~ "Writing",
      Variable == "critical_thinking_mean" ~ "Critical Thinking"
    ),
    Profile = paste("Profile", Class)
  )

# Profile plot with actual scores (0-100 scale)
ggplot(plot_data, aes(x = Variable, y = Mean, group = Profile, color = Profile)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  scale_color_manual(values = c("Profile 1" = "blue", 
                               "Profile 2" = "red", 
                               "Profile 3" = "green")) +
  scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 20)) +
  labs(
    title = "Student Academic Performance Profiles",
    subtitle = "Mean Scores Across Academic Domains (0-100 scale)",
    x = "Academic Domains",
    y = "Average Score",
    color = "Profile"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(size = 14, face = "bold"),
    legend.position = "bottom"
  )


# Add horizontal reference lines
last_plot() + 
  geom_hline(yintercept = c(60, 80), linetype = "dashed", alpha = 0.5) +
  annotate("text", x = 5, y = 82, label = "High Performance (80+)", size = 3) +
  annotate("text", x = 5, y = 62, label = "Average Performance (60+)", size = 3)

6.2 Scatter Plot Matrix

# Scatter plot of key variables colored by profile
library(GGally)

scatter_data <- profile_assignments %>%
  select(math_score, reading_score, science_score, writing_score, Class) %>%
  mutate(Profile = as.factor(paste("Profile", Class)))

ggpairs(scatter_data, 
        columns = 1:4,
        aes(color = Profile, alpha = 0.7),
        title = "Academic Performance Scatter Plot Matrix by Profile") +
  theme_minimal()

Interpretation: The scatter plots show how well the profiles separate across different pairs of academic variables. Clear clustering indicates distinct student performance patterns.

6.3 Distribution Comparisons

# Box plots for each variable by profile
plot_list <- list()
variables <- c("math_score", "reading_score", "science_score", "writing_score", "critical_thinking")
var_labels <- c("Mathematics", "Reading", "Science", "Writing", "Critical Thinking")

for(i in 1:length(variables)) {
  plot_list[[i]] <- profile_assignments %>%
    mutate(Profile = as.factor(paste("Profile", Class))) %>%
    ggplot(aes_string(x = "Profile", y = variables[i], fill = "Profile")) +
    geom_boxplot(alpha = 0.7) +
    scale_fill_manual(values = c("Profile 1" = "lightblue", 
                                "Profile 2" = "lightcoral", 
                                "Profile 3" = "lightgreen")) +
    labs(title = var_labels[i], y = "Score") +
    ylim(0, 100) +
    theme_minimal() +
    theme(legend.position = "none")
}

# Arrange plots
library(gridExtra)
grid.arrange(grobs = plot_list, ncol = 3, 
             top = "Distribution of Academic Scores Across Profiles")


cat("\nInterpretation: Box plots show the distribution and overlap of scores\n")

Interpretation: Box plots show the distribution and overlap of scores

cat("within each profile. Clear separation indicates distinct performance levels,\n")

within each profile. Clear separation indicates distinct performance levels,

cat("while overlap suggests some students are on the boundary between profiles.\n")

while overlap suggests some students are on the boundary between profiles.

7 Summary and Interpretation

7.1 Final Results Summary

Dataset: Simulated Student Academic Performance Data Sample size: 300 students Variables analyzed: Mathematics, Reading, Science, Writing, Critical Thinking

OPTIMAL MODEL:

Model type: Model 1 (Equal variances, zero covariances)
Number of profiles: 3
BIC: 1.221034^{4}
Entropy: 0.908 (Classification accuracy)
Overall accuracy vs. true profiles: 95%

profiles_ordered <- profile_means %>% arrange(desc(math_mean))

PROFILE DESCRIPTIONS:

Profile 2 - High Achievers (38% of sample):

Characteristics: High Math (84.7), High Reading (86.7), High Science (81.5)
Examples: Students scoring 80+ across all domains
Interpretation: Consistently high academic performance across all subjects

Profile 3 - Average Performers (42.7% of sample):

Characteristics: Moderate Math (63.2), Moderate Reading (67.4), Moderate Science (63.6)
Examples: Students scoring 60-70 range across domains
Interpretation: Typical academic achievement representing mainstream performance

Profile 1 - Struggling Students (19.3% of sample):

Characteristics: Lower Math (43.9), Lower Reading (47.7), Lower Science (40)
Examples: Students scoring below 50 across most domains
Interpretation: Academic challenges requiring additional support

PRACTICAL IMPLICATIONS:

Schools can use these profiles for differentiated instruction planning
Educators can tailor interventions to each profile’s specific needs
Each profile represents distinct learning support requirements
The analysis provides data-driven evidence for resource allocation decisions

7.2 Methodological Notes

STRENGTHS OF THIS ANALYSIS:

Clear, interpretable profiles with educational relevance
High classification accuracy (entropy > 0.80)
Profiles align with educational theory about student performance
Adequate sample size (n=300) for robust results
Variables showed meaningful differentiation across profiles
Validation against true profiles demonstrates LPA effectiveness)

LIMITATIONS AND CONSIDERATIONS:

Simulated data may not capture all real-world educational complexity
Cross-sectional data - student profiles may change over time with intervention
Academic performance influenced by many factors not included in model
Results may not generalize to different educational contexts or age groups
Profiles are probabilistic, not absolute categories for student classification

RECOMMENDATIONS FOR YOUR OWN LPA:

Test multiple models (1-6 profiles) and compare fit indices
Ensure adequate sample size (>50 cases per expected profile)
Use multiple fit criteria, not just one
Validate profiles with external educational variables when possible
Consider theoretical meaningfulness alongside statistical fit
Report classification uncertainty and profile characteristics
Use visualizations to communicate results effectively to educators

8 References and Further Reading

Collins, L. M., & Lanza, S. T. (2010). Latent class and latent transition analysis: With applications in the social, behavioral, and health sciences. John Wiley & Sons.
Spurk, D., Hirschi, A., Wang, M., Valero, D., & Kauffeld, S. (2020). Latent profile analysis: A review and “how to” guide of its application within vocational behavior research. Journal of Vocational Behavior, 120, 103445.
Rosenberg, J. M., Beymer, P. N., Anderson, D. J., Van Lissa, C. J., & Schmidt, J. A. (2018). tidyLPA: An R package to easily carry out Latent Profile Analysis (LPA) using open-source or commercial software. Journal of Open Source Software, 3(30), 978.

Additional Resources: - tidyLPA documentation: https://data-edu.github.io/tidyLPA/ - Mplus User’s Guide for mixture modeling - CRAN Task View: Cluster Analysis & Finite Mixture Models

Latent Profile Analysis (LPA)

Proprietary Students Dataset.

Amir Freund

2025-08-25