library(readr) # For reading CSV files
library(ggplot2) # For plotting (if needed)
library(caret) # For training the Naive Bayes model
library(naivebayes) # For Naive Bayes classification
library(tibble) # For tidy data output
library(knitr) # For nice tables
library(dplyr) # For data wrangling
library(kableExtra) # For enhanced table formatting
vesas_data <- read_csv("PostBayesIPSummer_250619.csv")
## Rows: 40 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ResponseId
## dbl (9): PartID, PerceptGr, Duration, Sex, FinalRaceEthic, CT_E, CT_V, CTEz,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Convert grouping variable to factor
vesas_data$PerceptGr <- as.factor(vesas_data$PerceptGr)
This analysis is part of a broader study investigating how students’ interpretations of academic challenges relate to their motivation for pursuing STEM. Specifically, the study examined how students label past science-related experiences—as failures, successes, or neutral events—and how those labels align with their self-reported expectations for success and perceived value of STEM fields.
As part of the data collection, each student responded to two open-ended prompts:
“Tell me about a challenge you faced in a science class. Was this a failure, success, or neither? Please explain why.”
“Tell me about a challenge you faced in a math class. Was this a failure, success, or neither? Please explain why.”
Each student’s responses were coded to capture their general perception of the experience For every mention of failure, students received a score of -1; for every mention of success, a score of +1; and for neutral responses, a score of 0. These scores were summed across both prompts to create a cumulative label score. Students with positive scores were assigned to the Success group, negative scores to the Failure group, and scores of zero to the Neutral group. Students who did not provide interpretable responses were classified as Unclassified.
In addition to the open-ended responses, students completed self-report surveys (VESAS) measuring two key aspects of motivation: Expectancy (confidence in succeeding in STEM) and Value (importance placed on STEM). Since the VESAS scores were not used to create the perception groups, they provided an independent measure for validating the groupings. A scatterplot of standardized (z-scored) VESAS scores revealed meaningful differences consistent with motivational theory: students in the Failure group tended to have lower expectancy and value scores compared to those in the Success and Neutral groups.
Feedback to this initial validation method has been met with significant resistance — for good reason, I. believe. The reliance on visual inspection of standardized VESAS scores (z-scores) does not provide the statistical rigor necessary to confidently support the perception groupings. This limitation needs to be addressed and requires a more robust and formal validation approach to justify the grouping method.
# Standardized VESAS scores are 'CTEz' (Expectancy) and 'CTVz' (Value)
ggplot(vesas_data, aes(x = CTEz, y = CTVz, color = PerceptGr)) +
geom_point(size = 3, alpha = 0.7) +
scale_color_manual(
values = c("0" = "red", "1" = "gray", "2" = "blue", "3" = "black"),
labels = c("Failure", "Neutral", "Success", "Did Not State"),
name = "Perception Group"
) +
labs(
title = " Scatterplot of Standardized Expectancy & Value Scores by Perception Group",
x = "Standardized Expectancy Score (CTEz)",
y = "Standardized Value Score (CTVz)"
) +
theme_minimal() +
theme(
legend.position = "right",
plot.title = element_text(hjust = 0.5)
)
Naive Bayes is a probabilistic classification method based on Bayes’ theorem, which updates the probability of a hypothesis as more evidence becomes available. It assumes conditional independence between predictors given the class label, simplifying calculations while often providing effective classification even when this assumption is violated.
This method is appropriate for the current study because it handles categorical outcome variables—in this case, perception groups—and can incorporate continuous predictor variables, such as standardized VESAS Expectancy and Value scores. Moreover, Naive Bayes models output posterior probabilities, allowing for assessment of prediction confidence at the individual level. Its relatively simple assumptions and computational efficiency make it well suited for smaller datasets, providing a practical and interpretable means of validating the original qualitative groupings.
To implement the Naive Bayes classification, the model was trained using 10-fold cross-validation to ensure robust estimation of predictive performance. The standardized VESAS Expectancy and Value scores served as predictors, while perception group membership was the outcome variable. Model hyperparameters were set to default values, as the sample size limited extensive tuning.
library(caret)
library(naivebayes)
# Set up 10-fold cross-validation
train_control <- trainControl(method = "cv", number = 10)
# Train Naive Bayes model
model <- train(
PerceptGr ~ CT_E + CT_V,
data = vesas_data,
method = "naive_bayes",
trControl = train_control,
tuneGrid = expand.grid(
usekernel = FALSE,
laplace = 0,
adjust = 1
)
)
# Print the model summary
print(model)
## Naive Bayes
##
## 40 samples
## 2 predictor
## 4 classes: '0', '1', '2', '3'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 36, 37, 36, 36, 35, 36, ...
## Resampling results:
##
## Accuracy Kappa
## 0.6645833 0.2517628
##
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
## parameter 'usekernel' was held constant at a value of FALSE
## Tuning
## parameter 'adjust' was held constant at a value of 1
# Predict the class labels
vesas_data$PredictedGroup <- predict(model, vesas_data, type = "raw")
# Predict the posterior probabilities
posterior_probs <- predict(model, vesas_data, type = "prob")
For each student, the Naive Bayes model calculated posterior probabilities representing the likelihood of belonging to each perception group based on their VESAS Expectancy and Value scores. The highest posterior probability indicates the group to which the student is most likely to belong according to the model. This predicted group assignment was then compared to the student’s original perception group—based on qualitative sorting—to assess the consistency and validity of the initial classification.
library(kableExtra)
# Step 1: Get predicted classes
vesas_data$PredictedGroup <- predict(model, vesas_data, type = "raw")
# Step 2: Get posterior probabilities
posterior_probs <- predict(model, vesas_data, type = "prob")
# Step 3: Combine into final table
results_tbl <- vesas_data %>%
mutate(
Prob_Group0 = posterior_probs[, "0"],
Prob_Group1 = posterior_probs[, "1"],
Prob_Group2 = posterior_probs[, "2"],
Prob_Group3 = posterior_probs[, "3"]
) %>%
select(PartID, PerceptGr, PredictedGroup, Prob_Group0, Prob_Group1, Prob_Group2, Prob_Group3)
# Step 4: Display using kable and scroll box
results_tbl %>%
kable("html", digits = 4, col.names = c("PartID", "Actual Group", "Predicted Group", "Prob 0", "Prob 1", "Prob 2", "Prob 3")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE) %>%
scroll_box(width = "100%", height = "400px")
| PartID | Actual Group | Predicted Group | Prob 0 | Prob 1 | Prob 2 | Prob 3 |
|---|---|---|---|---|---|---|
| 2 | 2 | 2 | 0.0360 | 0.0515 | 0.9125 | 0.0000 |
| 3 | 3 | 3 | 0.0001 | 0.0001 | 0.0002 | 0.9996 |
| 8 | 2 | 2 | 0.1463 | 0.1755 | 0.6728 | 0.0055 |
| 9 | 0 | 2 | 0.1065 | 0.1808 | 0.7077 | 0.0050 |
| 10 | 2 | 0 | 0.8201 | 0.0016 | 0.1783 | 0.0000 |
| 12 | 2 | 2 | 0.0270 | 0.0195 | 0.9534 | 0.0000 |
| 13 | 2 | 2 | 0.0743 | 0.0056 | 0.9084 | 0.0117 |
| 14 | 2 | 2 | 0.0352 | 0.0187 | 0.9461 | 0.0000 |
| 15 | 2 | 2 | 0.1724 | 0.2835 | 0.5382 | 0.0059 |
| 16 | 2 | 0 | 0.9355 | 0.0003 | 0.0641 | 0.0000 |
| 20 | 2 | 2 | 0.1003 | 0.3477 | 0.5520 | 0.0000 |
| 21 | 2 | 2 | 0.1852 | 0.0539 | 0.7609 | 0.0000 |
| 22 | 0 | 2 | 0.1320 | 0.1436 | 0.7189 | 0.0055 |
| 23 | 2 | 2 | 0.0360 | 0.0515 | 0.9125 | 0.0000 |
| 24 | 0 | 3 | 0.0007 | 0.0003 | 0.0011 | 0.9979 |
| 25 | 3 | 3 | 0.0001 | 0.0001 | 0.0002 | 0.9997 |
| 26 | 1 | 2 | 0.0703 | 0.3292 | 0.6005 | 0.0000 |
| 27 | 2 | 2 | 0.0522 | 0.0088 | 0.9376 | 0.0014 |
| 28 | 2 | 2 | 0.2708 | 0.1677 | 0.5605 | 0.0010 |
| 29 | 2 | 2 | 0.0797 | 0.2806 | 0.6397 | 0.0000 |
| 30 | 2 | 2 | 0.0250 | 0.0107 | 0.9643 | 0.0000 |
| 31 | 0 | 0 | 0.9900 | 0.0000 | 0.0100 | 0.0000 |
| 32 | 2 | 2 | 0.0229 | 0.0077 | 0.9694 | 0.0000 |
| 35 | 2 | 0 | 0.8867 | 0.0007 | 0.1126 | 0.0000 |
| 36 | 0 | 0 | 0.9998 | 0.0000 | 0.0002 | 0.0000 |
| 39 | 0 | 0 | 0.8558 | 0.0011 | 0.1431 | 0.0000 |
| 41 | 2 | 2 | 0.0248 | 0.0140 | 0.9612 | 0.0000 |
| 43 | 1 | 2 | 0.0457 | 0.1664 | 0.7879 | 0.0000 |
| 45 | 2 | 2 | 0.0406 | 0.0574 | 0.9020 | 0.0000 |
| 46 | 2 | 2 | 0.0299 | 0.0390 | 0.9312 | 0.0000 |
| 47 | 0 | 0 | 0.5017 | 0.0175 | 0.4808 | 0.0000 |
| 48 | 2 | 2 | 0.1173 | 0.2197 | 0.6582 | 0.0049 |
| 49 | 0 | 2 | 0.0508 | 0.1825 | 0.7667 | 0.0000 |
| 50 | 2 | 2 | 0.0291 | 0.0087 | 0.9623 | 0.0000 |
| 51 | 0 | 0 | 0.9561 | 0.0000 | 0.0439 | 0.0000 |
| 55 | 0 | 2 | 0.0874 | 0.2125 | 0.6994 | 0.0006 |
| 56 | 1 | 2 | 0.1065 | 0.2934 | 0.5995 | 0.0006 |
| 57 | 2 | 2 | 0.1239 | 0.0386 | 0.8365 | 0.0010 |
| 59 | 2 | 2 | 0.0230 | 0.0094 | 0.9676 | 0.0000 |
| 60 | 1 | 2 | 0.3596 | 0.1757 | 0.4538 | 0.0109 |
# Compare predicted group to actual group
match_logical <- vesas_data$PredictedGroup == vesas_data$PerceptGr
# Calculate percent match
percent_match <- mean(match_logical) * 100
percent_match <- round(percent_match, 2)
# Format to show two decimal places
formatted_match <- formatC(percent_match, format = "f", digits = 2)
# Print the result
paste("The model correctly classified", formatted_match, "% of the cases.")
## [1] "The model correctly classified 70.00 % of the cases."
The Naive Bayes classifier correctly predicted students’ perception group membership with an overall accuracy of approximately 70% (i.e., 28 of 40 students), indicating substantial alignment between students’ motivational profiles and their qualitative groupings. Posterior probabilities provided additional insight into classification confidence for individual cases.
Perfect classification is unlikely given the inherent complexity of human motivation and the qualitative nature of the original groupings. Students’ self-reported perceptions of challenges may be influenced by factors beyond what Expectancy and Value scores capture, such as emotional states, contextual variables, individual differences in interpretation, and social desirability bias—where students may portray themselves more positively. Additionally, the Naive Bayes model’s simplifying assumption of predictor independence and the relatively small sample size limit classification precision.
Interestingly, most of the mismatches between predicted and actual perception groups occurred among students whose standardized Expectancy and Value scores fell within one standard deviation of the mean on both scales. Only three mismatched cases were outside this range. In essence, for students with average motivational profiles, the perception of challenges could reasonably fall into more than one group, making classification less certain. This ambiguity reflects the inherently fluid nature of motivation and perception for those near the middle of the motivational spectrum.
Therefore, an accuracy of approximately 70% is plausible and meaningful, indicating that motivational scores capture substantial, though not exhaustive, information about students’ perception group membership.
# Create a variable indicating whether prediction matches actual
vesas_data <- vesas_data %>%
mutate(Match = ifelse(PredictedGroup == PerceptGr, "Match", "Mismatch"))
# Map numeric groups to labels for clarity
group_labels <- c("0" = "Failure", "1" = "Neutral", "2" = "Success", "3" = "Did Not State")
vesas_data$ActualLabel <- factor(vesas_data$PerceptGr, levels = names(group_labels), labels = group_labels)
vesas_data$PredictedLabel <- factor(vesas_data$PredictedGroup, levels = names(group_labels), labels = group_labels)
# Plot with color = Actual group, shape = Match status
ggplot(vesas_data, aes(x = CTEz, y = CTVz, color = ActualLabel, shape = Match)) +
geom_point(size = 3, alpha = 0.8) +
scale_color_manual(
values = c("Failure" = "red", "Neutral" = "gray", "Success" = "blue", "Did Not State" = "black")
) +
labs(
title = "VESAS Scores by Sorted Perception Group with Prediction Match Status",
x = "Standardized Expectancy Score (CTEz)",
y = "Standardized Value Score (CTVz)",
color = "Actual Group",
shape = "Prediction Match"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "right"
)