Title: How Does ChatGPT Perform on the Gynecologic Oncology and Critical Care PROLOG Exam?
Authors: Eric Helm, Alex Spyridon Mastroyannis, Tyler Muffly
Objectives Artificial intelligence (AI) is rapidly becoming integrated into standard medical technologies, with proposed applications as adjunct tools in clinical decision-making. This study aims to evaluate ChatGPT’s ability to correctly answer board-level obstetrics and gynecology (OBGYN) questions related to gynecologic oncology and critical care.
Methods The American College of Obstetricians and Gynecologists (ACOG) produces PROLOG practice questions to reflect board-level assessments. We evaluated ChatGPT using the eighth edition of the PROLOG Gynecologic Oncology and Critical Care question set. Each question was categorized based on Bloom’s Taxonomy into three difficulty levels: Easy (Knowledge, Understand), Moderate (Apply), and Advanced (Analysis, Evaluate). The “Create” taxonomy level was excluded due to the multiple-choice format. Additionally, questions were stratified by clinical content area: counseling, epidemiology and biostatistics, medical management, screening and diagnosis, and surgical management. Image-based questions were excluded. Statistical comparisons were performed using chi-square analysis.
Results Of the 148 PROLOG questions, 136 met inclusion criteria. ChatGPT achieved an overall accuracy of 78.7% (107/136). By difficulty level, ChatGPT scored 88.0% (44/50) on easy questions, 79.6% (39/49) on moderate questions, and 64.9% (24/37) on advanced questions (p = 0.033). When analyzed by content area, ChatGPT achieved 62.5% (5/8) on counseling, 100% (6/6) on epidemiology and biostatistics, 82.5% (52/63) on medical management, 73.9% (17/23) on screening and diagnosis, and 75.0% (27/36) on surgical management questions (p = 0.046). The highest error rate was observed in counseling (37.5% error rate) and analysis-level questions (35.1% error rate).
Conclusion ChatGPT’s overall accuracy on OBGYN board-level questions in gynecologic oncology was 78.7%, just below the passing threshold of 80.0%. The AI performed significantly worse on higher-order cognitive tasks, particularly analysis-level questions, and exhibited lower accuracy in counseling and surgical management domains, both of which require clinical judgment beyond textbook knowledge. These findings suggest that while AI models like ChatGPT may serve as useful study aids, their limitations in decision-making contexts must be addressed before integration into clinical workflows. Further research is necessary to refine AI applications in medical education and patient care.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
output <- analyze_ai_diagnostics(data_path = "/Users/tylermuffly/Dropbox (Personal)/prolog/Data/Prolog Datasheet.xlsx",
file_type = "excel",
sheet_name = "Data",
ai_answer_col = "ChatGPT Answer",
expert_answer_col = "Actual Answer",
category_col = "Category",
taxonomy_col = "Taxonomy",
correct_col = "Correct",
verbose = TRUE); output
## $overall_metrics
## metric value
## 1 Total Questions 136.00
## 2 Correct Answers 107.00
## 3 Incorrect Answers 29.00
## 4 Accuracy (%) 78.68
## 5 Concordance Count 107.00
## 6 Concordance (%) 78.68
##
## $category_metrics
## # A tibble: 5 × 7
## Category total_questions correct_count incorrect_count accuracy
## <chr> <int> <int> <int> <dbl>
## 1 Epidemiology and Biost… 6 6 0 100
## 2 Medical Management 63 52 11 82.5
## 3 Surgical Management 36 27 9 75
## 4 Screening and Diagnosis 23 17 6 73.9
## 5 Counseling 8 5 3 62.5
## # ℹ 2 more variables: concordance_count <int>, concordance_percentage <dbl>
##
## $taxonomy_metrics
## # A tibble: 4 × 7
## Taxonomy total_questions correct_count incorrect_count accuracy
## <chr> <int> <int> <int> <dbl>
## 1 Understand 33 30 3 90.9
## 2 Knowledge 17 14 3 82.4
## 3 Apply 49 39 10 79.6
## 4 Analysis 37 24 13 64.9
## # ℹ 2 more variables: concordance_count <int>, concordance_percentage <dbl>
##
## $error_analysis
## $error_analysis$error_by_category
## # A tibble: 4 × 4
## Category error_count total_count error_percentage
## <chr> <int> <int> <dbl>
## 1 Counseling 3 8 37.5
## 2 Screening and Diagnosis 6 23 26.1
## 3 Surgical Management 9 36 25
## 4 Medical Management 11 63 17.5
##
## $error_analysis$error_by_taxonomy
## # A tibble: 4 × 4
## Taxonomy error_count total_count error_percentage
## <chr> <int> <int> <dbl>
## 1 Analysis 13 37 35.1
## 2 Apply 10 49 20.4
## 3 Knowledge 3 17 17.6
## 4 Understand 3 33 9.09
##
##
## $summary
## finding value
## 1 Overall Accuracy (%) 78.68
## 2 Overall Concordance (%) 78.68
## 3 Best Performing Category Epidemiology and Biostats
## 4 Best Category Accuracy (%) 100
## 5 Challenging Category Counseling
## 6 Challenging Category Accuracy (%) 62.5
## 7 Best Performing Cognitive Level Understand
## 8 Best Cognitive Level Accuracy (%) 90.91
## 9 Challenging Cognitive Level Analysis
## 10 Challenging Cognitive Level Accuracy (%) 64.86
## 11 Category with Highest Error Rate Counseling
## 12 Highest Category Error Rate (%) 37.5
## 13 Taxonomy with Highest Error Rate Analysis
## 14 Highest Taxonomy Error Rate (%) 35.14
##
## $visualizations
## $visualizations$category_accuracy_plot
##
## $visualizations$taxonomy_accuracy_plot
##
## $visualizations$category_error_plot
##
## $visualizations$taxonomy_error_plot
##
##
## $data
## # A tibble: 136 × 8
## Question Category Taxonomy `ChatGPT Answer` `Actual Answer` Correct
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 10 Counseling Underst… B C No
## 2 22 Counseling Underst… C C Yes
## 3 32 Counseling Apply C C Yes
## 4 45 Counseling Apply D E No
## 5 76 Counseling Underst… B B Yes
## 6 107 Counseling Knowled… E E Yes
## 7 110 Counseling Knowled… B C No
## 8 117 Counseling Apply C C Yes
## 9 28 Epidemiology and … Analysis D D Yes
## 10 43 Epidemiology and … Apply A A Yes
## # ℹ 126 more rows
## # ℹ 2 more variables: is_correct <lgl>, concordance <lgl>
knitr::kable(output$overall_metrics)
metric | value |
---|---|
Total Questions | 136.00 |
Correct Answers | 107.00 |
Incorrect Answers | 29.00 |
Accuracy (%) | 78.68 |
Concordance Count | 107.00 |
Concordance (%) | 78.68 |
output$category_metrics
## # A tibble: 5 × 7
## Category total_questions correct_count incorrect_count accuracy
## <chr> <int> <int> <int> <dbl>
## 1 Epidemiology and Biost… 6 6 0 100
## 2 Medical Management 63 52 11 82.5
## 3 Surgical Management 36 27 9 75
## 4 Screening and Diagnosis 23 17 6 73.9
## 5 Counseling 8 5 3 62.5
## # ℹ 2 more variables: concordance_count <int>, concordance_percentage <dbl>
output$taxonomy_metrics
## # A tibble: 4 × 7
## Taxonomy total_questions correct_count incorrect_count accuracy
## <chr> <int> <int> <int> <dbl>
## 1 Understand 33 30 3 90.9
## 2 Knowledge 17 14 3 82.4
## 3 Apply 49 39 10 79.6
## 4 Analysis 37 24 13 64.9
## # ℹ 2 more variables: concordance_count <int>, concordance_percentage <dbl>
output$summary
## finding value
## 1 Overall Accuracy (%) 78.68
## 2 Overall Concordance (%) 78.68
## 3 Best Performing Category Epidemiology and Biostats
## 4 Best Category Accuracy (%) 100
## 5 Challenging Category Counseling
## 6 Challenging Category Accuracy (%) 62.5
## 7 Best Performing Cognitive Level Understand
## 8 Best Cognitive Level Accuracy (%) 90.91
## 9 Challenging Cognitive Level Analysis
## 10 Challenging Cognitive Level Accuracy (%) 64.86
## 11 Category with Highest Error Rate Counseling
## 12 Highest Category Error Rate (%) 37.5
## 13 Taxonomy with Highest Error Rate Analysis
## 14 Highest Taxonomy Error Rate (%) 35.14
This output provides key performance metrics of ChatGPT’s accuracy across different categories and cognitive levels.
output$visualizations$category_accuracy_plot
output$visualizations$taxonomy_accuracy_plot
output$visualizations$category_error_plot
output$visualizations$taxonomy_error_plot
### Interpretation of Each Bar Plot
library(tidyverse)
library(broom)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
library(beepr)
library(logger)
library(stats)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# Load data
data <- readxl::read_excel("Data/Prolog Datasheet.xlsx", sheet = "Data") %>%
dplyr::filter(Taxonomy != "Need picture to answer")
# Convert Correct column to binary
log_info("Converting 'Correct' column to binary format.")
data <- data %>%
mutate(Correct_binary = ifelse(Correct == "Yes", 1, 0))
# Overall accuracy
log_info("Calculating overall accuracy.")
overall_accuracy <- data %>%
count(Correct) %>%
mutate(Percentage = n / sum(n) * 100)
print(overall_accuracy)
## # A tibble: 2 × 3
## Correct n Percentage
## <chr> <int> <dbl>
## 1 No 29 21.3
## 2 Yes 107 78.7
# Accuracy by Category
log_info("Calculating accuracy by Category.")
accuracy_by_category <- data %>%
group_by(Category) %>%
count(Correct) %>%
mutate(Percentage = n / sum(n) * 100)
print(accuracy_by_category)
## # A tibble: 9 × 4
## # Groups: Category [5]
## Category Correct n Percentage
## <chr> <chr> <int> <dbl>
## 1 Counseling No 3 37.5
## 2 Counseling Yes 5 62.5
## 3 Epidemiology and Biostats Yes 6 100
## 4 Medical Management No 11 17.5
## 5 Medical Management Yes 52 82.5
## 6 Screening and Diagnosis No 6 26.1
## 7 Screening and Diagnosis Yes 17 73.9
## 8 Surgical Management No 9 25
## 9 Surgical Management Yes 27 75
# Accuracy by Taxonomy
log_info("Calculating accuracy by Taxonomy.")
accuracy_by_taxonomy <- data %>%
group_by(Taxonomy) %>%
count(Correct) %>%
mutate(Percentage = n / sum(n) * 100)
print(accuracy_by_taxonomy)
## # A tibble: 8 × 4
## # Groups: Taxonomy [4]
## Taxonomy Correct n Percentage
## <chr> <chr> <int> <dbl>
## 1 Analysis No 13 35.1
## 2 Analysis Yes 24 64.9
## 3 Apply No 10 20.4
## 4 Apply Yes 39 79.6
## 5 Knowledge No 3 17.6
## 6 Knowledge Yes 14 82.4
## 7 Understand No 3 9.09
## 8 Understand Yes 30 90.9
# Chi-square test for correctness across categories
table_category <- table(data$Category, data$Correct)
chi_category <- chisq.test(table_category)
## Warning in chisq.test(table_category): Chi-squared approximation may be
## incorrect
log_info("Chi-square test for Category vs Correctness: p-value = {}", chi_category$p.value)
print(chi_category)
##
## Pearson's Chi-squared test
##
## data: table_category
## X-squared = 4.0356, df = 4, p-value = 0.4012
# Chi-square test for correctness across taxonomy levels
table_taxonomy <- table(data$Taxonomy, data$Correct)
chi_taxonomy <- chisq.test(table_taxonomy)
## Warning in chisq.test(table_taxonomy): Chi-squared approximation may be
## incorrect
log_info("Chi-square test for Taxonomy vs Correctness: p-value = {}", chi_taxonomy$p.value)
print(chi_taxonomy)
##
## Pearson's Chi-squared test
##
## data: table_taxonomy
## X-squared = 7.312, df = 3, p-value = 0.06259
# T-tests for correctness between Taxonomy levels
taxonomy_levels <- unique(data$Taxonomy)
t_test_results <- map_dfr(combn(taxonomy_levels, 2, simplify = FALSE), function(pair) {
test <- t.test(Correct_binary ~ Taxonomy, data = filter(data, Taxonomy %in% pair))
tibble(Comparison = paste(pair, collapse = " vs "), t_stat = test$statistic, p_value = test$p.value)
})
print(t_test_results)
## # A tibble: 6 × 3
## Comparison t_stat p_value
## <chr> <dbl> <dbl>
## 1 Understand vs Apply -1.47 0.147
## 2 Understand vs Knowledge -0.792 0.436
## 3 Understand vs Analysis -2.76 0.00768
## 4 Apply vs Knowledge -0.247 0.806
## 5 Apply vs Analysis -1.49 0.140
## 6 Knowledge vs Analysis -1.41 0.167
# Logistic regression with collinearity adjustments
log_info("Running logistic regression with collinearity adjustments.")
data <- data %>% drop_na(Correct_binary)
model_data <- data %>%
dplyr::select(Correct_binary, Category, Taxonomy) %>%
mutate(across(c(Category, Taxonomy), as.factor))
dummies <- model.matrix(~ Category + Taxonomy, data = model_data)[, -1]
df_model <- data.frame(Correct_binary = model_data$Correct_binary, dummies)
# Check for multicollinearity
vif_values <- vif(glm(Correct_binary ~ ., data = df_model, family = binomial))
log_info("VIF values calculated.")
print(vif_values)
## CategoryEpidemiology.and.Biostats CategoryMedical.Management
## 1.000000 3.745479
## CategoryScreening.and.Diagnosis CategorySurgical.Management
## 2.720433 3.427444
## TaxonomyApply TaxonomyKnowledge
## 1.287591 1.256417
## TaxonomyUnderstand
## 1.243537
# Remove high collinearity variables (VIF > 10)
variables_to_keep <- names(vif_values)[vif_values <= 10]
df_model_reduced <- df_model %>% dplyr::select(Correct_binary, all_of(variables_to_keep))
# Run logistic regression again
logit_model <- glm(Correct_binary ~ ., data = df_model_reduced, family = binomial)
logit_summary <- tidy(logit_model)
print(logit_summary)
## # A tibble: 8 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.836 0.894 -0.935 0.350
## 2 CategoryEpidemiology.and.Biostats 17.3 1536. 0.0112 0.991
## 3 CategoryMedical.Management 1.65 0.865 1.91 0.0557
## 4 CategoryScreening.and.Diagnosis 0.913 0.907 1.01 0.314
## 5 CategorySurgical.Management 1.22 0.889 1.37 0.169
## 6 TaxonomyApply 0.913 0.518 1.76 0.0781
## 7 TaxonomyKnowledge 1.27 0.781 1.63 0.103
## 8 TaxonomyUnderstand 1.89 0.739 2.56 0.0104
beepr::beep(2) # Indicate completion
These results show the output of a logistic regression analysis examining factors that influence the accuracy of AI responses in the Prolog Datasheet. The model is predicting the log-odds of a correct answer based on category and taxonomy variables. Here’s a detailed interpretation:
The intercept (-0.836) represents the baseline log-odds of a correct answer when all predictor variables are at their reference levels. This negative value suggests that at baseline (likely for “Counseling” category and “Analysis” taxonomy, which appear to be the reference categories), the probability of a correct answer is less than 50%.
Cognitive Level Effect: The strongest and only clearly significant predictor is the “Understand” taxonomy level, indicating that questions requiring understanding (rather than analysis) are much more likely to be answered correctly.
Category Trends: While not reaching strict statistical significance, there’s a trend suggesting better performance in Medical Management categories compared to the reference.
Epidemiology Statistical Issue: The perfect or near-perfect performance in Epidemiology and Biostats created a statistical issue in the model (complete separation), making that coefficient unreliable.
Hierarchical Performance: The coefficients for taxonomy levels follow a logical pattern: Understanding > Knowledge > Apply > (Reference/Analysis), suggesting performance decreases as cognitive complexity increases.
Model Limitations: Several non-significant predictors suggest the sample size may be insufficient to detect more subtle effects, particularly within categories.
This analysis aligns with the earlier findings that performance varies more notably across cognitive taxonomy levels than across subject categories, with cognitive complexity being a stronger predictor of AI performance than the specific medical domain.