From Menarche to Menopause: Using Generative AI to Explore the Reproductive Life Cycle

Author

Ctrl+Alt+Defeat

From Menarche to Menopause: Using Generative AI to Explore the Reproductive Life Cycle

Our team, Ctrl+Alt+Defeat, is excited to participate in this year’s Women in Data: Datathon 2024 with a project focused on generative AI and its role in supporting reproductive health decisions. As a team composed of individuals who have experienced menstruation, we understand the critical need for the 1.8 billion people worldwide¹ who menstruate and approximately 1.2 billion people who are menopausal or postmenopausal² to have access to reliable and unbiased information. Our research aims to uncover biases in generative AI and identify fairness gaps affecting inclusivity.

Problem Statement

Generative AI platforms are increasingly used to provide health information, yet there is limited research on how well they address the needs of people who menstruate across the reproductive life cycle. This study examines differences in accessibility, readability, tone and supportiveness, and quality of use case responses across five generative AI platforms. By systematically comparing these platforms, the research aims to uncover disparities in the provision of reproductive health information and identify fairness issues that may disproportionately impact marginalized or underserved groups.

Research Questions

Research Question 1: What are the differences in accessibility options between genAI platforms?

Research Question 2: What are the differences in readability between genAI platforms?

Research Question 3: What are the differences in tone and supportiveness between genAI platforms?

Research Question 4: What are the differences in quality of use case responses across genAI platforms, with quality referring to the effectiveness of recommendations?

Generative AI Platforms

ChatGPT offers human-like conversations across diverse domains, making it ideal for detailed research support. Its adaptability suits various users, from researchers to professionals, providing reliable, in-depth responses.

Claude is known for precision and clarity, Claude excels at concise answers in multiple languages, making it a go-to for users prioritizing efficiency and straightforward information..

Grok specializes in technical topics, offering thorough, detailed responses on complex subjects. It’s highly valued by academic and professional users seeking deep insights into niche areas.

Pi focuses on emotional intelligence and delivers empathetic and personal interactions. It’s ideal for users seeking meaningful, emotionally nuanced conversations with AI.

Venice fosters open, uncensored communication, allowing unrestricted dialogue. Its inclusivity and transparency appeal to users looking for an unfiltered, authentic experience.

Prompt Categories

Prompts

Research Question 1: Accessibility Options

Methods

Descriptive analysis examined the accessibility features offered by the five AI platforms for individuals with various accessibility needs, including visual, auditory, English as a second language, neurodivergence, and general accommodations.

Results

When totaling all features, ChatGPT outranks all other platforms for all types of accommodations. ChatGPT has broad accessibility support due to its strong integration with third-party tools and services, making it highly compatible with screen readers and high-contrast modes. It lacks adjustable message display speed and sensory considerations but offers features such as Multilingual support.

Pi is consistently ranked second for all types of accommodations. Pi has some advanced features like text-to-speech and visual cues, making it moderately accessible but still lacks support for braille displays, closed captioning, and alt text for images.

Grok ranks third for number of features for visual and auditory accommodations. Grok focuses on text-to-speech and visual cues but falls short in terms of broader accessibility features like braille support, customization, and real-time feedback.

Venice ranks third for number of features for English as a second language, neurodivergence, and general accommodations. Venice supports multilingual and simple language interactions but lacks advanced accessibility features like adjustable font sizes, iconography and visual aids.

Implications

Overall, is essential to consider accessibility in the design and implementation of AI platforms to ensure inclusivity and equal opportunities for individuals with diverse accessibility needs. By integrating comprehensive accessibility features, AI platforms can play a vital role in empowering and supporting individuals, promoting greater participation and engagement, and fostering an inclusive society.

Research Question 2: Readability Analysis

Methods

We analyzed the text complexity of each response generated by calculating their Flesch-Kincaid reading levels, which estimate the grade level needed to understand each response. We assessed the ease of comprehension of the text produced by these systems across various use cases using the Sylcount R package. Raw response data, including metadata about use cases and AI systems, were imported and pre-processed by removing single and double quotes to ensure clean input. This was essential for accurate readability scoring.

Readability Analysis

Along with readability, we also measured word count and sentence count to provide insights into the structure and length of the AI-generated text. We analyzed the mean readability across the AI platforms, grouping responses by AI system to determine average Flesch-Kincaid scores. We also compared readability by use case to assess how the context influenced the complexity of generated text.

Results

A one-way ANOVA test confirmed significant differences across the platforms (F(4, 423) = 13.06, p < .001).

Tukey’s post-hoc tests further supported these findings (all p < .001). Specifically, Venice showed significantly higher readability levels compared to all other platforms. In particular, Venice’s readability was much more complex than ChatGPT, with a mean difference of 4.67 (p < .001), indicating that Venice’s content was geared toward a more advanced reading level.

Non-significant readability differences were observed between ChatGPT, Pi, Grok, and Claude, suggesting similar content accessibility between these platforms.

Implications

These findings have critical implications for health communication. Venice, with its higher readability scores, may not be suitable for patient-facing materials, which need to be clear and accessible to diverse audiences. Its complex text is better suited for academic or professional use. In contrast, Pi, Claude, Grok, and ChatGPT, with their lower readability scores, offer more accessible content and are better aligned with the needs of patient education. This highlights the potential for tailoring AI models to generate content at varying reading levels, depending on the target audience, from patients to healthcare professionals.

Research Question 3: Tone and Supportiveness

Methods

We gathered responses from five generative AI platforms across twelve use-case prompts, each related to reproductive health. The sentiment analysis script, adapted from the LADAL.edu template, was modified to fit our dataset structure. Each response was analyzed at the individual prompt level, allowing for a granular examination of emotional tone across platforms. Chart shown for illustrative purposes only.

Sentiment Analysis

The sentiment analysis measured the prevalence of ten emotions—anger, anticipation, disgust, fear, joy, negative, positive, sadness, surprise, and trust—across each response. We calculated the polarity ratio by dividing the prevalence of positive words by negative words, with a ratio greater than 1 indicating a more positive overall tone. This analysis was performed for each response rather than combining all responses per platform, enabling us to capture differences in tone by both platform and use case.

Results

We compared the polarity ratio (composite across all 10 emotions) using a one-way ANOVA. Although non-significant differences were found, likely due to a small sample size and limited power, use case responses did trend towards lower scores compared with research responses.

Differences in Emotions

We also looked at the mean differences in emotions by platform to identify potential patterns and trends in emotional tones.

Variation in Emotional Responses Across Platforms: Each platform displays noticeable differences in how strongly they express certain emotions. For example, Pi has a higher representation of anger and negative sentiment compared to other platforms.

Positivity and Trust: ChatGPT, Claude, and Grok show higher levels of positivity and trust, indicating a more empathetic tone for sensitive health topics. Their supportive language creates a comforting experience, making users feel understood and cared for during emotionally charged discussions like reproductive health. This empathetic approach may help fosters a sense of safety and reliability in these platforms.

Implications

When an AI responds with warmth and understanding, it can make a world of difference in how people engage with it, especially when discussing personal health concerns. A comforting tone helps people feel heard and safe, encouraging them to be transparent about their symptoms or worries—aspects they might hesitate to share otherwise. This kind of trust allows the AI to respond with more tailored and relevant advice, helping people feel supported and cared for in a way that goes beyond just giving information. It’s about creating a connection that makes people feel comfortable sharing what really matters.

Research Question 4: Quality of Use Case Responses

Methods

This analysis evaluated the quality of symptom management recommendations from five generative AI platforms across six dimensions: accuracy, relevance to life stage, clarity, actionability, comprehensiveness, and tone and supportiveness.

Rating Scale Development

A standardized 1-to-5 rating scale was developed to assess the effectiveness of each platform, from inaccurate or irrelevant responses to highly accurate, comprehensive, and empathetic recommendations. Three subject matter experts (SMEs) rated responses to 12 unique prompts using these six criteria. Each generative AI platform was rated across all prompts, resulting in 360 ratings. This approach provided a comprehensive comparison of the quality and effectiveness of recommendations provided by each platform.

Quality Analysis

To ensure consistency in evaluations, we calculated inter-rater reliability among the SMEs, which ranged from 0.5 to 1.0 (indicating moderate to strong agreement). Following this, statistical analyses, including significance testing, were performed to identify differences in response quality across age. By analyzing variations in ratings for each dimension, we aimed to pinpoint strengths and weaknesses in each system’s ability to offer accurate, relevant, and actionable health recommendations for various age groups.

Results by Age

Accuracy steadily decreases with age. Actionability starts highest for teens and drops across age groups. Clarity increases with age, peaking for the 50+ group. Comprehensiveness declines consistently as age increases. Relevance remains fairly consistent, with only a slight drop in the 50+ group. Supportiveness shows a fluctuating trend, peaking in the 20s group but dropping off afterward.

Implications

The trends suggest potential age-related biases in AI systems, with declining accuracy, actionability, and comprehensiveness for older users. This indicates that older individuals may receive lower-quality information, possibly due to training data not reflecting their needs. Additionally, reduced supportiveness for older users highlights a lack of empathy. Addressing these biases is essential to ensure fairness and improve AI-driven support across all age groups.

Bias and Fairness

Our study highlights bias and fairness issues in AI’s handling of sensitive health topics, with notable differences across platforms. Some AI systems reflect biases in tone, assumptions, and text complexity, potentially misrepresenting diverse lived experiences.

Key findings include:

Cisnormative assumptions: AI often assumes users are cisgender, likely not including transgender men and non-binary individuals.
Potential age bias: Older adults receive more neutral-toned responses than younger users.
Language challenges: Non-native English speakers may struggle with higher text complexity.
Underrepresentation: These biases may stem from under-representation in AI training, limiting diversity in responses.

The Path Forward

Our path forward emphasizes the need for more inclusive and empathetic AI development.

Key recommendations include:

Inclusive training data: Involve LGBTQIA+ communities, older adults, and underrepresented communities in AI model training.
Gender-neutral language: Refine AI models to recognize identity cues and use gender-neutral language.
Culturally sensitive advice: Ensure AI systems provide advice that respects diverse cultural contexts.
Ongoing evaluation: Regular audits, user feedback, fairness metrics, and community involvement are crucial for fostering empathy, inclusivity, and health equity in AI.

Acknowledgments

Design

We would like to thank Ann Martin from Mind for Media for generously designing our team logo and Microsoft Teams background, contributing her talents not-for-profit to support our efforts.

All images were created with the assistance of DALL·E 3 (2024).

Software and R Package Citations

Allaire, J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2024). quarto: R Interface to ‘Quarto’ Markdown Publishing System. https://CRAN.R-project.org/package=quarto

Bliese, P., Chen, G., Downes, P., Schepker, D., & Lang, J. (2022). multilevel: Multilevel Functions. R package version 2.7. https://CRAN.R-project.org/package=multilevel

Bray, A. (2023). gg-close-read: A Quarto Extension for Close Reading. University of California, Berkeley. https://github.com/andrewpbray/gg-close-read

Müller, K., & Wickham, H. (2023). tibble: Simple Data Frames. https://CRAN.R-project.org/package=tibble

Revelle, W. (2024). psych: Procedures for Psychological, Psychometric, and Personality Research. https://CRAN.R-project.org/package=psych

Rinker, T. W. (2023). readability: Readability Scores. https://github.com/trinker/readability

Schauberger, P., & Walker, A. (2023). openxlsx: Read, Write and Edit xlsx Files. https://ycphs.github.io/openxlsx/

Schweinberger, M. (2022). Sentiment analysis in R. https://ladal.edu.au/sentiment.html

Silge, J., & Robinson, D. (2024). tidytext: Text Mining using ‘dplyr’, ‘ggplot2’, and Other Tidy Tools. https://CRAN.R-project.org/package=tidytext

Wickham, H. (2023). stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr

Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr

Wickham, H., Vaughan, D., & Girlich, M. (2024). tidyr: Tidy Messy Data. https://CRAN.R-project.org/package=tidyr

Footnotes

Rohatgi, A., & Dash, S. (2023, March 1). Period poverty and mental health of menstruators during COVID-19 pandemic: Lessons and implications for the future. Frontiers in global women’s health. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10014781/#:~:text=Around%201.8%20billion%20people%20menstruate,26%25%20of%20the%20global%20population.↩︎
Hill, K. (1996). The demography of menopause in 2030. Maturitas, 23(2), 113–127. https://doi.org/10.1016/0378-5122(95)00968-x↩︎

--- title: "From Menarche to Menopause: Using Generative AI to Explore the Reproductive Life Cycle" author: Ctrl+Alt+Defeat format: closeread-html: css: ctrl_alt_defeat4.css remove-header-space: true code-tools: true embed-resources: true root: "C:/Users/shind/Desktop/R Projects" --- # From Menarche to Menopause: Using Generative AI to Explore the Reproductive Life Cycle :::{#cr-logo} ![](cover_transparent.png){height=500px} ::: :::{.epigraph} > Our team, Ctrl+Alt+Defeat, is excited to participate in this year’s [**Women in Data: Datathon 2024**](https://womenindata.mn.co/posts/datathon-2024-datathon-faq-2024){style="color:#4B3F72;"} with a project focused on generative AI and its role in supporting reproductive health decisions. As a team composed of individuals who have experienced menstruation, we understand the critical need for the 1.8 billion people worldwide[^text] who menstruate and approximately 1.2 billion people who are menopausal or postmenopausal[^text2] to have access to reliable and unbiased information. Our research aims to uncover biases in generative AI and identify fairness gaps affecting inclusivity. ::: :::{.cr-section} :::{style="padding-block: 20svh"} ::: :::{#cr-map} ![](abstract_anatomy.png) ::: ## Problem Statement Generative AI platforms are increasingly used to provide health information, yet there is limited research on how well they address the needs of people who menstruate across the reproductive life cycle. This study examines differences in accessibility, readability, tone and supportiveness, and quality of use case responses across five generative AI platforms. By systematically comparing these platforms, the research aims to uncover disparities in the provision of reproductive health information and identify fairness issues that may disproportionately impact marginalized or underserved groups. [@cr-map] ::: ## Research Questions [**Research Question 1:**]{style="color:#4B3F72;"} What are the differences in [*accessibility options*]{style="color:#4B3F72;"} between genAI platforms? [**Research Question 2:**]{style="color:#4B3F72;"} What are the differences in [*readability*]{style="color:#4B3F72;"} between genAI platforms? [**Research Question 3:**]{style="color:#4B3F72;"} What are the differences in [*tone and supportiveness*]{style="color:#4B3F72;"} between genAI platforms? [**Research Question 4:**]{style="color:#4B3F72;"} What are the differences in [*quality of use case responses*]{style="color:#4B3F72;"} across genAI platforms, with quality referring to the effectiveness of recommendations? :::{.cr-section} :::{style="padding-block: 20svh"} ::: ## Generative AI Platforms :::{#cr-gpt} ![](ChatGPT-Logo.png) ::: **ChatGPT** offers human-like conversations across diverse domains, making it ideal for detailed research support. Its adaptability suits various users, from researchers to professionals, providing reliable, in-depth responses. [@cr-gpt] :::{#cr-claude} ![](Claude_AI.png) ::: **Claude** is known for precision and clarity, Claude excels at concise answers in multiple languages, making it a go-to for users prioritizing efficiency and straightforward information.. [@cr-claude] :::{#cr-grok} ![](Grok.png) ::: **Grok** specializes in technical topics, offering thorough, detailed responses on complex subjects. It's highly valued by academic and professional users seeking deep insights into niche areas. [@cr-grok] :::{#cr-pi} ![](Pi.png) ::: **Pi** focuses on emotional intelligence and delivers empathetic and personal interactions. It’s ideal for users seeking meaningful, emotionally nuanced conversations with AI. [@cr-pi] :::{#cr-venice} ![](venice.png) ::: **Venice** fosters open, uncensored communication, allowing unrestricted dialogue. Its inclusivity and transparency appeal to users looking for an unfiltered, authentic experience. [@cr-venice] ::: ## Prompt Categories :::{#cr-prompt} ![](prompt_cat.png) ::: ## Prompts ```{r, echo = FALSE, warning=FALSE, message=FALSE} library(DT) library(readxl) Prompts <- read_excel("C:/Users/shind/Desktop/Ctrl + Alt + Defeat/Prompts.xlsx") datatable(Prompts, options = list(pageLength = 5, autoWidth = TRUE)) ``` ## Research Question 1: Accessibility Options ### Methods Descriptive analysis examined the accessibility features offered by the five AI platforms for individuals with various accessibility needs, including **visual**, **auditory**, **English as a second language**, **neurodivergence**, and **general accommodations**. [@cr-acc] ```{r, echo = FALSE, warning = FALSE, message = FALSE} library(DT) # Create a data frame for the table content accommodations_data <- data.frame( Feature = c("Screen Reader Compatibility", "High Contrast Mode", "Adjustable Font Sizes", "Braille Display Support", "Text-to-Speech", "Visual Cues for Notifications", "Closed Captioning for Video Calls", "Simple, Clear Language", "Multilingual Support", "Iconography and Visual Aids", "Predictable Layout and Flow", "Customization Options", "Clear, Literal Language", "Timed Messages", "Sensory Considerations", "Alt Text for Images", "Speech-to-Text", "Real-Time Feedback for Typing", "Adjustable Message Display Speed", "Multiple Communication Channels"), ChatGPT = c("✓", "✓", "✓", "✓", "✓", "✓", "✓", "✓", "✓", "✓", "✓", "✓", "✓", "✓", "", "✓", "✓", "✓", "", "✓"), Claude = c("✓", "✓", "✓", "", "✓", "✓", "✓", "✓", "✓", "✓", "✓", "", "✓", "", "", "", "✓", "", "", "✓"), Grok = c("✓", "✓", "", "", "", "✓", "✓", "✓", "", "", "✓", "", "✓", "", "", "", "", "", "", ""), Pi = c("✓", "✓", "✓", "", "", "✓", "", "✓", "✓", "✓", "✓", "", "✓", "", "", "", "", "", "", "✓"), Venice = c("✓", "✓", "✓", "", "", "✓", "", "✓", "✓", "", "✓", "", "✓", "", "", "", "", "", "", "✓") ) # Sort the accommodations_data by the Feature column in ascending order accommodations_data_sorted <- accommodations_data[order(accommodations_data$Feature), ] # Create the datatable with sorted Feature column datatable(accommodations_data_sorted, options = list(pageLength = 5, autoWidth = TRUE), rownames = FALSE, escape = FALSE) ``` :::{.cr-section} :::{style="padding-block: 20svh"} ::: :::{#cr-bump1} ![](bump1.png) ::: :::{#cr-bump2} ![](bump2.png) ::: :::{#cr-bump3} ![](bump3.png) ::: :::{#cr-bump4} ![](bump4.png) ::: :::{#cr-dot} ![](dot_plot.png) ::: :::{#cr-accesibility} ![](accessibility.png) ::: :::{#cr-reading} ![](reading_imp.png) ::: :::{#cr-readinglvl} ```{r, warning=FALSE, echo=FALSE, message=FALSE} # Library library(DT) # Create the data for the table data <- data.frame( `Flesch-Kincaid Grade Level` = c("0-1", "1-5", "5-11", "11-18"), `School level` = c("Pre-kindergarten - 1st grade", "1st grade - 5th grade", "5th grade - 11th grade", "11th grade - 18th grade"), `Student age range` = c("3-7", "7-11", "11-17", "17 and above") ) # Use the datatable() function from DT package datatable(data, rownames = FALSE, options = list(pageLength = 5, autoWidth = TRUE), colnames = c('Flesch-Kincaid Grade Level', 'School Level', 'Student Age Range')) ``` ::: ### Results When totaling all features, [**ChatGPT**]{style="color:#4B3F72;"} outranks all other platforms for all types of accommodations. ChatGPT has broad accessibility support due to its strong integration with third-party tools and services, making it highly compatible with screen readers and high-contrast modes. It lacks adjustable message display speed and sensory considerations but offers features such as Multilingual support. [@cr-bump1] [**Pi**]{style="color:#4B3F72;"} is consistently ranked second for all types of accommodations. Pi has some advanced features like text-to-speech and visual cues, making it moderately accessible but still lacks support for braille displays, closed captioning, and alt text for images. [@cr-bump2] [**Grok**]{style="color:#4B3F72;"} ranks third for number of features for visual and auditory accommodations. Grok focuses on text-to-speech and visual cues but falls short in terms of broader accessibility features like braille support, customization, and real-time feedback. [@cr-bump3] [**Venice**]{style="color:#4B3F72;"} ranks third for number of features for English as a second language, neurodivergence, and general accommodations. Venice supports multilingual and simple language interactions but lacks advanced accessibility features like adjustable font sizes, iconography and visual aids. [@cr-bump4] ### Implications Overall, is essential to consider accessibility in the design and implementation of AI platforms to ensure inclusivity and equal opportunities for individuals with diverse accessibility needs. By integrating comprehensive accessibility features, AI platforms can play a vital role in empowering and supporting individuals, promoting greater participation and engagement, and fostering an inclusive society. [@cr-accesibility] ## Research Question 2: Readability Analysis ### Methods We analyzed the text complexity of each response generated by calculating their **Flesch-Kincaid** reading levels, which estimate the grade level needed to understand each response. We assessed the ease of comprehension of the text produced by these systems across various use cases using the Sylcount R package. Raw response data, including metadata about use cases and AI systems, were imported and pre-processed by removing single and double quotes to ensure clean input. This was essential for accurate readability scoring. [@cr-readinglvl] ### Readability Analysis Along with readability, we also measured word count and sentence count to provide insights into the structure and length of the AI-generated text. We analyzed the mean readability across the AI platforms, grouping responses by AI system to determine average Flesch-Kincaid scores. We also compared readability by use case to assess how the context influenced the complexity of generated text. [@cr-readinglvl] ### Results A one-way ANOVA test confirmed [**significant differences**]{style="color:#4B3F72;"} across the platforms (F(4, 423) = 13.06, p < .001). [@cr-dot] Tukey's post-hoc tests further supported these findings (all p < .001). Specifically, [**Venice**]{style="color:#4B3F72;"} showed significantly higher readability levels compared to all other platforms. In particular, Venice's readability was much more complex than ChatGPT, with a mean difference of 4.67 (p < .001), indicating that Venice's content was geared toward a more advanced reading level. [@cr-dot]{pan-to="-40%,30%" scale-by="1.5"} **Non-significant** readability differences were observed between ChatGPT, Pi, Grok, and Claude, suggesting similar content accessibility between these platforms. [@cr-dot]{pan-to="-40%,-20%" scale-by="1.5"} ### Implications These findings have critical implications for health communication. Venice, with its higher readability scores, may not be suitable for patient-facing materials, which need to be clear and accessible to diverse audiences. Its complex text is better suited for academic or professional use. In contrast, Pi, Claude, Grok, and ChatGPT, with their lower readability scores, offer more accessible content and are better aligned with the needs of patient education. This highlights the potential for tailoring AI models to generate content at varying reading levels, depending on the target audience, from patients to healthcare professionals. [@cr-reading] ## Research Question 3: Tone and Supportiveness :::{#cr-polarity} ![](polarity_ratio.png) ::: :::{#cr-polarity_comp} ![](polarity_comp.png) ::: :::{#cr-rainbow2} ![](rainbow2.png) ::: ### Methods We gathered responses from five generative AI platforms across twelve use-case prompts, each related to reproductive health. The sentiment analysis script, adapted from the LADAL.edu template, was modified to fit our dataset structure. Each response was analyzed at the individual prompt level, allowing for a granular examination of emotional tone across platforms. *Chart shown for illustrative purposes only.* [@cr-polarity] ### Sentiment Analysis The sentiment analysis measured the prevalence of ten emotions—**anger**, **anticipation**, **disgust**, **fear**, **joy**, **negative**, **positive**, **sadness**, **surprise**, and **trust**—across each response. We calculated the polarity ratio by dividing the prevalence of positive words by negative words, with a ratio greater than 1 indicating a more positive overall tone. This analysis was performed for each response rather than combining all responses per platform, enabling us to capture differences in tone by both platform and use case. ### Results :::{#cr-multiple} ```{r, echo = FALSE, warning=FALSE, message=FALSE} # Load required libraries library(ggplot2) library(plotly) library(dplyr) # Provided data data <- data.frame( Platform = c("ChatGPT", "Claude", "Grok", "PI", "Venice"), Anger = c(2.23, 2.08, 2.14, 3.51, 3.07), Anticipation = c(3.37, 4.37, 3.77, 4.05, 4.26), Disgust = c(2.26, 2.18, 2.46, 4.91, 5.09), Fear = c(4.21, 3.83, 4.30, 6.08, 5.82), Joy = c(3.63, 4.53, 3.79, 6.17, 7.68), Negative = c(8.31, 7.58, 7.67, 9.48, 8.63), Positive = c(12.87, 13.30, 13.11, 10.64, 10.13), Sadness = c(3.94, 3.35, 4.01, 5.22, 5.20), Surprise = c(1.85, 2.66, 2.13, 4.61, 4.30), Trust = c(5.70, 5.99, 6.39, 4.04, 5.93) ) # Hex colors for emotions emotion_colors <- c( Anger = "#D32F2F", Anticipation = "#F57C00", Disgust = "#355E3B", Fear = "#455A64", Joy = "#FFEB3B", Negative = "#B71C1C", Positive = "#4CAF50", Sadness = "#1E88E5", Surprise = "#FF9800", Trust = "#43A047" ) # Reshape the data for ggplot2 data_melted <- reshape2::melt(data, id.vars = "Platform", variable.name = "Emotion", value.name = "Average") # Add text aesthetic for tooltips p <- ggplot(data_melted, aes(x = reorder(Platform, -Average), y = Average, fill = Emotion, text = paste("Platform: ", Platform, "<br>Emotion: ", Emotion, "<br>Average: ", round(Average, 2)))) + geom_bar(stat = "identity") + facet_wrap(~ Emotion, ncol = 4) + scale_fill_manual(values = emotion_colors) + coord_flip() + labs(caption = "", size = 2, x="", y = "") + theme_minimal() + theme(plot.title = element_text(size = 18), plot.subtitle = element_text(size = 16), plot.caption = element_text(size = 14), axis.title = element_text(size = 14), axis.text = element_text(size = 14), strip.text = element_text(size = 14, face = "bold"), axis.text.x = element_blank()) + theme(panel.background = element_rect(fill = "#f4f4f6", color = NA), plot.background = element_rect(fill = "#f4f4f6", color = NA), legend.background = element_rect(fill = "#f4f4f6", color = NA), legend.position = "none") + theme(panel.spacing = unit(1.5, "lines")) # Convert to plotly with tooltip set to display the 'text' aesthetic ggplotly(p, tooltip = "text") %>% layout(width = 900, height = 450) ``` ::: We compared the polarity ratio (composite across all 10 emotions) using a one-way ANOVA. Although non-significant differences were found, likely due to a small sample size and limited power, use case responses did trend towards lower scores compared with research responses. [@cr-polarity_comp] ### Differences in Emotions We also looked at the mean differences in emotions by platform to identify potential patterns and trends in emotional tones. [@cr-multiple] **Variation in Emotional Responses Across Platforms:** Each platform displays noticeable differences in how strongly they express certain emotions. For example, Pi has a higher representation of [**anger**]{style="color:#D32F2F;"} and [**negative sentiment**]{style="color:#B71C1C;"} compared to other platforms. [@cr-multiple]{pan-to="40%,40%" scale-by="2.0"} **Positivity and Trust:** ChatGPT, Claude, and Grok show higher levels of [**positivity**]{style="color:#4CAF50;"} and [**trust**]{style="color:#43A047;"}, indicating a more empathetic tone for sensitive health topics. Their supportive language creates a comforting experience, making users feel understood and cared for during emotionally charged discussions like reproductive health. This empathetic approach may help fosters a sense of safety and reliability in these platforms.[@cr-multiple]{pan-to="-10%,-10%" scale-by="1.5"} ### Implications When an AI responds with warmth and understanding, it can make a world of difference in how people engage with it, especially when discussing personal health concerns. A comforting tone helps people feel heard and safe, encouraging them to be transparent about their symptoms or worries—aspects they might hesitate to share otherwise. This kind of trust allows the AI to respond with more tailored and relevant advice, helping people feel supported and cared for in a way that goes beyond just giving information. It’s about creating a connection that makes people feel comfortable sharing what really matters. [@cr-rainbow2] ## Research Question 4: Quality of Use Case Responses :::{#cr-scale} ```{r, echo = FALSE, warning=FALSE, message=FALSE} # Install and load the DT package if not already installed install.packages("DT") library(DT) # Create a data frame for the evaluation criteria evaluation_data <- data.frame( Scale_Category = c( "Accuracy", "", "", "", "", "", "Relevance to Specific Stage of Life", "", "", "", "", "", "Clarity", "", "", "", "", "", "Actionability", "", "", "", "", "", "Comprehensiveness", "", "", "", "", "", "Tone and Supportiveness", "", "", "", "", "" ), Scale = c( "How accurate is the information provided by the AI in relation to current medical knowledge?", "1: The response contains factual errors or misinformation.", "2: The response is partially accurate but misses key details.", "3: The response is mostly accurate but lacks depth or minor updates.", "4: The response is accurate and provides relevant details.", "5: The response is entirely accurate and reflects current best practices.", "How well does the AI’s response address the specific symptoms and concerns related to the user's life stage?", "1: The response does not relate to the age group or symptoms described.", "2: The response partially addresses the user’s concerns but includes irrelevant details.", "3: The response addresses most of the user’s concerns but lacks specificity.", "4: The response is relevant to the user’s age group and their specific symptoms.", "5: The response is highly relevant, personalized to the user’s life stage and concerns.", "How clear and understandable is the AI’s response for the user to comprehend and apply?", "1: The response is confusing or overly complex.", "2: The response is somewhat clear but requires further clarification.", "3: The response is mostly clear but may include some complex terms or concepts.", "4: The response is clear and easy to understand.", "5: The response is exceptionally clear and straightforward for any user.", "Does the AI provide practical, actionable steps for managing the symptoms?", "1: The response provides no actionable advice.", "2: The response provides minimal or vague advice.", "3: The response provides some actionable advice, but it may be too general.", "4: The response provides clear and practical steps for managing symptoms.", "5: The response provides highly detailed and practical steps that are immediately applicable.", "How thoroughly does the AI address the full scope of the user's query?", "1: The response is superficial and fails to cover major points.", "2: The response covers only a small portion of the user's concerns.", "3: The response is moderately comprehensive but leaves out important details.", "4: The response is detailed and covers most aspects of the question.", "5: The response is highly comprehensive and covers all relevant aspects.", "Does the AI’s response have an empathetic tone and provide a supportive approach?", "1: The response is cold, unsupportive, or dismissive.", "2: The response is somewhat neutral with limited support.", "3: The response shows some empathy but could be more supportive.", "4: The response is supportive and understanding.", "5: The response is highly empathetic and comforting." ), stringsAsFactors = FALSE ) # Create the datatable datatable(evaluation_data, options = list(pageLength = 6, autoWidth = TRUE), rownames = FALSE, colnames = c('Scale Category', 'Scale')) ``` ::: ### Methods This analysis evaluated the quality of symptom management recommendations from five generative AI platforms across six dimensions: accuracy, relevance to life stage, clarity, actionability, comprehensiveness, and tone and supportiveness.[@cr-scale] ### Rating Scale Development A standardized [**1-to-5**]{style="color:#4B3F72;"} rating scale was developed to assess the effectiveness of each platform, from inaccurate or irrelevant responses to highly accurate, comprehensive, and empathetic recommendations. Three subject matter experts (SMEs) rated responses to 12 unique prompts using these six criteria. Each generative AI platform was rated across all prompts, resulting in 360 ratings. This approach provided a comprehensive comparison of the quality and effectiveness of recommendations provided by each platform.[@cr-scale] ### Quality Analysis To ensure consistency in evaluations, we calculated inter-rater reliability among the SMEs, which ranged from 0.5 to 1.0 (indicating moderate to strong agreement). Following this, statistical analyses, including significance testing, were performed to identify differences in response quality across age. By analyzing variations in ratings for each dimension, we aimed to pinpoint strengths and weaknesses in each system's ability to offer accurate, relevant, and actionable health recommendations for various age groups. ::: :::{.cr-section} :::{style="padding-block: 20svh"} ::: :::{#cr-age_imp} ![](age_imp.png) ::: ### Results by Age :::{#cr-age} ```{r, echo = FALSE, warning=FALSE, message=FALSE, fig.width= 10, fig.height= 8} # Required packages library(ggplot2) library(tidyr) library(plotly) # Data df <- data.frame( Age_Group = factor(c("Teens", "20s", "30s", "40s", "50+"), levels = c("Teens", "20s", "30s", "40s", "50+")), # Reordering Age_Group factor Accuracy = c(4.26, 4.18, 3.93, 3.88, 3.88), Relevance = c(4.21, 4.23, 4.22, 4.23, 4.14), Clarity = c(3.88, 4.12, 4.06, 4.07, 4.35), Actionability = c(4.62, 4.36, 4.33, 4.19, 4.18), Comprehensiveness = c(4.46, 4.44, 4.24, 4.12, 3.96), Supportiveness = c(3.19, 3.4, 2.85, 3.05, 2.71) ) # Reshape df_long <- gather(df, key = "Quality_Dimension", value = "Mean_Rating", -Age_Group) # ggplot2 p <- ggplot(df_long, aes(x = Age_Group, y = Mean_Rating, group = Quality_Dimension, color = Quality_Dimension)) + geom_line(color = "#4B3F72", size = 1.0) + geom_point(color = "#4B3F72", size = 1.0) + facet_wrap(~ Quality_Dimension) + labs(x = "Age Group", y = "Mean Rating") + scale_y_continuous(limits = c(1, 5), breaks = c(1, 2, 3, 4, 5)) + theme_minimal() + theme(legend.position = "none", strip.text = element_text(size = 10), axis.text.x = element_text(hjust = 0.5), panel.background = element_rect(fill = "transparent", color = NA), plot.background = element_rect(fill = "transparent", color = NA), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) # Convert to plotly for responsive container fitting interactive_plot <- ggplotly(p) # Display the interactive plot interactive_plot ``` ::: **Accuracy** steadily decreases with age. **Actionability** starts highest for teens and drops across age groups. **Clarity** increases with age, peaking for the 50+ group. **Comprehensiveness** declines consistently as age increases. **Relevance** remains fairly consistent, with only a slight drop in the 50+ group. **Supportiveness** shows a fluctuating trend, peaking in the 20s group but dropping off afterward.[@cr-age] ### Implications The trends suggest potential age-related biases in AI systems, with declining accuracy, actionability, and comprehensiveness for older users. This indicates that older individuals may receive lower-quality information, possibly due to training data not reflecting their needs. Additionally, reduced supportiveness for older users highlights a lack of empathy. Addressing these biases is essential to ensure fairness and improve AI-driven support across all age groups.[@cr-age_imp] ::: ## Bias and Fairness Our study highlights bias and fairness issues in AI's handling of sensitive health topics, with notable differences across platforms. Some AI systems reflect biases in tone, assumptions, and text complexity, potentially misrepresenting diverse lived experiences. **Key findings include:** - **Cisnormative assumptions**: AI often assumes users are cisgender, likely not including transgender men and non-binary individuals. - **Potential age bias:** Older adults receive more neutral-toned responses than younger users. - **Language challenges:** Non-native English speakers may struggle with higher text complexity. - **Underrepresentation:** These biases may stem from under-representation in AI training, limiting diversity in responses. ## The Path Forward Our path forward emphasizes the need for more inclusive and empathetic AI development. **Key recommendations include:** - **Inclusive training data:** Involve LGBTQIA+ communities, older adults, and underrepresented communities in AI model training. - **Gender-neutral language:** Refine AI models to recognize identity cues and use gender-neutral language. - **Culturally sensitive advice:** Ensure AI systems provide advice that respects diverse cultural contexts. - **Ongoing evaluation:** Regular audits, user feedback, fairness metrics, and community involvement are crucial for fostering empathy, inclusivity, and health equity in AI. ## Acknowledgments ### Design We would like to thank Ann Martin from [Mind for Media](https://www.mindformedia.com/) for generously designing our team logo and Microsoft Teams background, contributing her talents not-for-profit to support our efforts. All images were created with the assistance of DALL·E 3 (2024). ### Software and R Package Citations Allaire, J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2024). quarto: R Interface to 'Quarto' Markdown Publishing System. https://CRAN.R-project.org/package=quarto Bliese, P., Chen, G., Downes, P., Schepker, D., & Lang, J. (2022). multilevel: Multilevel Functions. R package version 2.7. https://CRAN.R-project.org/package=multilevel Bray, A. (2023). gg-close-read: A Quarto Extension for Close Reading. University of California, Berkeley. https://github.com/andrewpbray/gg-close-read Müller, K., & Wickham, H. (2023). tibble: Simple Data Frames. https://CRAN.R-project.org/package=tibble Revelle, W. (2024). psych: Procedures for Psychological, Psychometric, and Personality Research. https://CRAN.R-project.org/package=psych Rinker, T. W. (2023). readability: Readability Scores. https://github.com/trinker/readability Schauberger, P., & Walker, A. (2023). openxlsx: Read, Write and Edit xlsx Files. https://ycphs.github.io/openxlsx/ Schweinberger, M. (2022). Sentiment analysis in R. https://ladal.edu.au/sentiment.html Silge, J., & Robinson, D. (2024). tidytext: Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools. https://CRAN.R-project.org/package=tidytext Wickham, H. (2023). stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr Wickham, H., Vaughan, D., & Girlich, M. (2024). tidyr: Tidy Messy Data. https://CRAN.R-project.org/package=tidyr [^text]: Rohatgi, A., & Dash, S. (2023, March 1). *Period poverty and mental health of menstruators during COVID-19 pandemic: Lessons and implications for the future.* Frontiers in global women’s health. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10014781/#:~:text=Around%201.8%20billion%20people%20menstruate,26%25%20of%20the%20global%20population. [^text2]: Hill, K. (1996). The demography of menopause in 2030. Maturitas, 23(2), 113–127. https://doi.org/10.1016/0378-5122(95)00968-x