VESAS Bayes Grouping Validation

R Packages Used

library(readr)         # For reading CSV files
library(ggplot2)       # For plotting (if needed)
library(caret)         # For training the Naive Bayes model
library(naivebayes)    # For Naive Bayes classification
library(tibble)        # For tidy data output
library(knitr)         # For nice tables
library(dplyr)         # For data wrangling
library(kableExtra)    # For enhanced table formatting

Load and Prepare Data

vesas_data <- read_csv("PostBayesIPSummer_250619.csv")

## Rows: 40 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ResponseId
## dbl (9): PartID, PerceptGr, Duration, Sex, FinalRaceEthic, CT_E, CT_V, CTEz,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Convert grouping variable to factor
vesas_data$PerceptGr <- as.factor(vesas_data$PerceptGr)

Background

This analysis is part of a broader study investigating how students’ interpretations of academic challenges relate to their motivation for pursuing STEM. Specifically, the study examined how students label past science-related experiences—as failures, successes, or neutral events—and how those labels align with their self-reported expectations for success and perceived value of STEM fields.

As part of the data collection, each student responded to two open-ended prompts:

“Tell me about a challenge you faced in a science class. Was this a failure, success, or neither? Please explain why.”
“Tell me about a challenge you faced in a math class. Was this a failure, success, or neither? Please explain why.”

Each student’s responses were coded to capture their general perception of the experience For every mention of failure, students received a score of -1; for every mention of success, a score of +1; and for neutral responses, a score of 0. These scores were summed across both prompts to create a cumulative label score. Students with positive scores were assigned to the Success group, negative scores to the Failure group, and scores of zero to the Neutral group. Students who did not provide interpretable responses were classified as Unclassified.

Prior Validation Attempts

In addition to the open-ended responses, students completed self-report surveys (VESAS) measuring two key aspects of motivation: Expectancy (confidence in succeeding in STEM) and Value (importance placed on STEM). Since the VESAS scores were not used to create the perception groups, they provided an independent measure for validating the groupings. A scatterplot of standardized (z-scored) VESAS scores revealed meaningful differences consistent with motivational theory: students in the Failure group tended to have lower expectancy and value scores compared to those in the Success and Neutral groups.

Feedback to this initial validation method has been met with significant resistance — for good reason, I. believe. The reliance on visual inspection of standardized VESAS scores (z-scores) does not provide the statistical rigor necessary to confidently support the perception groupings. This limitation needs to be addressed and requires a more robust and formal validation approach to justify the grouping method.

# Standardized VESAS scores are 'CTEz' (Expectancy) and 'CTVz' (Value)
ggplot(vesas_data, aes(x = CTEz, y = CTVz, color = PerceptGr)) +
  geom_point(size = 3, alpha = 0.7) +
  scale_color_manual(
    values = c("0" = "red", "1" = "gray", "2" = "blue", "3" = "black"),
    labels = c("Failure", "Neutral", "Success", "Did Not State"),
    name = "Perception Group"
  ) +
  labs(
    title = " Scatterplot of Standardized Expectancy & Value Scores by Perception Group",
    x = "Standardized Expectancy Score (CTEz)",
    y = "Standardized Value Score (CTVz)"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right",
    plot.title = element_text(hjust = 0.5)
  )

Purpose of Naive Bayes

Naive Bayes is a probabilistic classification method based on Bayes’ theorem, which updates the probability of a hypothesis as more evidence becomes available. It assumes conditional independence between predictors given the class label, simplifying calculations while often providing effective classification even when this assumption is violated.

This method is appropriate for the current study because it handles categorical outcome variables—in this case, perception groups—and can incorporate continuous predictor variables, such as standardized VESAS Expectancy and Value scores. Moreover, Naive Bayes models output posterior probabilities, allowing for assessment of prediction confidence at the individual level. Its relatively simple assumptions and computational efficiency make it well suited for smaller datasets, providing a practical and interpretable means of validating the original qualitative groupings.

To implement the Naive Bayes classification, the model was trained using 10-fold cross-validation to ensure robust estimation of predictive performance. The standardized VESAS Expectancy and Value scores served as predictors, while perception group membership was the outcome variable. Model hyperparameters were set to default values, as the sample size limited extensive tuning.

Train Naive Bayes Model

library(caret)
library(naivebayes)

# Set up 10-fold cross-validation
train_control <- trainControl(method = "cv", number = 10)

# Train Naive Bayes model
model <- train(
  PerceptGr ~ CT_E + CT_V,
  data = vesas_data,
  method = "naive_bayes",
  trControl = train_control,
  tuneGrid = expand.grid(
    usekernel = FALSE,
    laplace = 0,
    adjust = 1
  )
)
# Print the model summary
print(model)

## Naive Bayes 
## 
## 40 samples
##  2 predictor
##  4 classes: '0', '1', '2', '3' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 36, 37, 36, 36, 35, 36, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.6645833  0.2517628
## 
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
##  parameter 'usekernel' was held constant at a value of FALSE
## Tuning
##  parameter 'adjust' was held constant at a value of 1

Generate Predictions

# Predict the class labels
vesas_data$PredictedGroup <- predict(model, vesas_data, type = "raw")

# Predict the posterior probabilities
posterior_probs <- predict(model, vesas_data, type = "prob")

Final Results Table

For each student, the Naive Bayes model calculated posterior probabilities representing the likelihood of belonging to each perception group based on their VESAS Expectancy and Value scores. The highest posterior probability indicates the group to which the student is most likely to belong according to the model. This predicted group assignment was then compared to the student’s original perception group—based on qualitative sorting—to assess the consistency and validity of the initial classification.

library(kableExtra)

# Step 1: Get predicted classes
vesas_data$PredictedGroup <- predict(model, vesas_data, type = "raw")

# Step 2: Get posterior probabilities
posterior_probs <- predict(model, vesas_data, type = "prob")

# Step 3: Combine into final table
results_tbl <- vesas_data %>%
  mutate(
    Prob_Group0 = posterior_probs[, "0"],
    Prob_Group1 = posterior_probs[, "1"],
    Prob_Group2 = posterior_probs[, "2"],
    Prob_Group3 = posterior_probs[, "3"]
  ) %>%
  select(PartID, PerceptGr, PredictedGroup, Prob_Group0, Prob_Group1, Prob_Group2, Prob_Group3)

# Step 4: Display using kable and scroll box
results_tbl %>%
  kable("html", digits = 4, col.names = c("PartID", "Actual Group", "Predicted Group", "Prob 0", "Prob 1", "Prob 2", "Prob 3")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE) %>%
  scroll_box(width = "100%", height = "400px")

PartID	Actual Group	Predicted Group	Prob 0	Prob 1	Prob 2	Prob 3
2	2	2	0.0360	0.0515	0.9125	0.0000
3	3	3	0.0001	0.0001	0.0002	0.9996
8	2	2	0.1463	0.1755	0.6728	0.0055
9	0	2	0.1065	0.1808	0.7077	0.0050
10	2	0	0.8201	0.0016	0.1783	0.0000
12	2	2	0.0270	0.0195	0.9534	0.0000
13	2	2	0.0743	0.0056	0.9084	0.0117
14	2	2	0.0352	0.0187	0.9461	0.0000
15	2	2	0.1724	0.2835	0.5382	0.0059
16	2	0	0.9355	0.0003	0.0641	0.0000
20	2	2	0.1003	0.3477	0.5520	0.0000
21	2	2	0.1852	0.0539	0.7609	0.0000
22	0	2	0.1320	0.1436	0.7189	0.0055
23	2	2	0.0360	0.0515	0.9125	0.0000
24	0	3	0.0007	0.0003	0.0011	0.9979
25	3	3	0.0001	0.0001	0.0002	0.9997
26	1	2	0.0703	0.3292	0.6005	0.0000
27	2	2	0.0522	0.0088	0.9376	0.0014
28	2	2	0.2708	0.1677	0.5605	0.0010
29	2	2	0.0797	0.2806	0.6397	0.0000
30	2	2	0.0250	0.0107	0.9643	0.0000
31	0	0	0.9900	0.0000	0.0100	0.0000
32	2	2	0.0229	0.0077	0.9694	0.0000
35	2	0	0.8867	0.0007	0.1126	0.0000
36	0	0	0.9998	0.0000	0.0002	0.0000
39	0	0	0.8558	0.0011	0.1431	0.0000
41	2	2	0.0248	0.0140	0.9612	0.0000
43	1	2	0.0457	0.1664	0.7879	0.0000
45	2	2	0.0406	0.0574	0.9020	0.0000
46	2	2	0.0299	0.0390	0.9312	0.0000
47	0	0	0.5017	0.0175	0.4808	0.0000
48	2	2	0.1173	0.2197	0.6582	0.0049
49	0	2	0.0508	0.1825	0.7667	0.0000
50	2	2	0.0291	0.0087	0.9623	0.0000
51	0	0	0.9561	0.0000	0.0439	0.0000
55	0	2	0.0874	0.2125	0.6994	0.0006
56	1	2	0.1065	0.2934	0.5995	0.0006
57	2	2	0.1239	0.0386	0.8365	0.0010
59	2	2	0.0230	0.0094	0.9676	0.0000
60	1	2	0.3596	0.1757	0.4538	0.0109

Model Accuracy

# Compare predicted group to actual group
match_logical <- vesas_data$PredictedGroup == vesas_data$PerceptGr

# Calculate percent match
percent_match <- mean(match_logical) * 100
percent_match <- round(percent_match, 2)

# Format to show two decimal places
formatted_match <- formatC(percent_match, format = "f", digits = 2)

# Print the result
paste("The model correctly classified", formatted_match, "% of the cases.")

## [1] "The model correctly classified 70.00 % of the cases."

Results Summary

The Naive Bayes classifier correctly predicted students’ perception group membership with an overall accuracy of approximately 70% (i.e., 28 of 40 students), indicating substantial alignment between students’ motivational profiles and their qualitative groupings. Posterior probabilities provided additional insight into classification confidence for individual cases.

Perfect classification is unlikely given the inherent complexity of human motivation and the qualitative nature of the original groupings. Students’ self-reported perceptions of challenges may be influenced by factors beyond what Expectancy and Value scores capture, such as emotional states, contextual variables, individual differences in interpretation, and social desirability bias—where students may portray themselves more positively. Additionally, the Naive Bayes model’s simplifying assumption of predictor independence and the relatively small sample size limit classification precision.

Interestingly, most of the mismatches between predicted and actual perception groups occurred among students whose standardized Expectancy and Value scores fell within one standard deviation of the mean on both scales. Only three mismatched cases were outside this range. In essence, for students with average motivational profiles, the perception of challenges could reasonably fall into more than one group, making classification less certain. This ambiguity reflects the inherently fluid nature of motivation and perception for those near the middle of the motivational spectrum.

Therefore, an accuracy of approximately 70% is plausible and meaningful, indicating that motivational scores capture substantial, though not exhaustive, information about students’ perception group membership.

Summary of Matches and Mismatches

# Create a variable indicating whether prediction matches actual
vesas_data <- vesas_data %>% 
  mutate(Match = ifelse(PredictedGroup == PerceptGr, "Match", "Mismatch"))

# Map numeric groups to labels for clarity
group_labels <- c("0" = "Failure", "1" = "Neutral", "2" = "Success", "3" = "Did Not State")
vesas_data$ActualLabel <- factor(vesas_data$PerceptGr, levels = names(group_labels), labels = group_labels)
vesas_data$PredictedLabel <- factor(vesas_data$PredictedGroup, levels = names(group_labels), labels = group_labels)

# Plot with color = Actual group, shape = Match status
ggplot(vesas_data, aes(x = CTEz, y = CTVz, color = ActualLabel, shape = Match)) +
  geom_point(size = 3, alpha = 0.8) +
  scale_color_manual(
    values = c("Failure" = "red", "Neutral" = "gray", "Success" = "blue", "Did Not State" = "black")
  ) +
  labs(
    title = "VESAS Scores by Sorted Perception Group with Prediction Match Status",
    x = "Standardized Expectancy Score (CTEz)",
    y = "Standardized Value Score (CTVz)",
    color = "Actual Group",
    shape = "Prediction Match"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "right"
  )