Using Data to Predict and Understand PCOS Symptoms

Author

Ksanet Redie

Introduction

The goal of this project is to build a machine learning model to predict whether an individual has polycystic ovary syndrome (PCOS). Using a dataset from Kaggle, we will explore and compare different machine learning techniques to identify the most accurate model for this binary classification problem.

What is PCOS?

For those who haven’t heard of polycystic ovary syndrome (PCOS), it’s a hormonal disorder that affects women during their childbearing years. When the body doesn’t produce the hormones needed for ovulation, the ovaries can’t release eggs properly. This can lead to cysts to form on the ovaries and an overproduction of androgens, which are male hormones. High androgen levels in women can cause many symptoms, including missed periods, acne, excess facial or body hair, weight gain, fatigue, infertility, headaches, and mood changes. PCOS impacts women both physically and emotionally and can lead to long-term health issues if not properly managed.

According to the National Institutes of Health, the official criteria for diagnosing PCOS in adults require that at least two of the following conditions be met:

Absent or irregular menstrual cycles
High levels of androgens not caused by other conditions
Growth of follicles of a specific size in at least one of the ovaries.

While the exact cause of PCOS is still unclear, both genetic and environmental factors are believed to play a role. Women with a family history of PCOS are more likely to develop the condition, which suggests a genetic link. In addition, poor diet, lack of exercise, and obesity can worsen symptoms or even trigger the condition. Researchers and doctors continue working to better understand the underlying causes of PCOS and how to treat it.

Inspiration and Motive

I decided to take on this project because I was recently diagnosed with PCOS after years of experiencing unexplained symptoms that were often brushed off. That experience made me wonder how many others might be going through the same thing without knowing the reason behind it. Even though polycystic ovary syndrome is the most common hormonal disorder among women of childbearing age, studies show that up to 75 percent of people with it go undiagnosed, usually not finding out until they try to have children.

Through this project, I wanted to explore how technology, specifically machine learning, can be used to improve women’s health and help people better understand their symptoms. By creating a model that predicts PCOS, those in similar situations could compare their own data and get a clearer idea of whether they might have the condition. Ultimately, my goal is to raise awareness about PCOS and support people who may have been overlooked in the past.

Exploratory Data Analysis

The dataset we’re using to predict whether someone has PCOS comes from Kaggle and includes patient data from 10 hospitals across India. We’ll be combining two separate datasets, one with infertility data and one without. After merging them, we’ll be left with 541 observations and 50 variables to explore.

Describing the Predictors

For each relevant column, 1 indicates “yes” and 0 indicates “no.”

patient_file_no - The file number used to identify each patient.
pcos_y_n_x - Whether or not the patient has been diagnosed with PCOS (what we’re trying to predict).
age_yrs - Patient’s age in years.
weight_kg - Patient’s weight in kilograms.
height_cm - Patient’s height in centimeters.
bmi - Patient’s body mass index, based on height and weight.
blood_group - Patient’s blood type, coded as: A+ = 11, A- = 12, B+ = 13, B- = 14, O+ = 15, O- = 16, AB+ = 17, AB- = 18.
pulse_rate_bpm - Patient’s heart rate in beats per minute. A normal resting heart rate for women is usually between 60 and 100 bpm.
rr_breaths_min - Respiratory rate, how many breaths the patient takes per minute. The normal range for women is between 12 and 18.
hb_g_dl - Hemoglobin level (grams per deciliter), measures the protein in red blood cells that carries oxygen. Normal range for women is 13.5 to 17.5.
cycle_r_i - Whether the patient’s menstrual cycle is regular or irregular, code as: 4 = irregular and 2 = regular.
cycle_length_days - The number of days the patient’s menstrual cycle lasts.
marraige_status_yrs - Number of years the patient has been married.
pregnant_y_n - Whether the patient is pregnant.
no_of_aborptions - Number of abortions, if any.
i_beta_hcg_m_iu_m_l_x - First beta-hCG reading, helps identify pregnancy. Less than 5 is negative, and over 25 is positive
ii_beta_hcg_m_iu_m_l_x - Second beta-hCG reading.
fsh_m_iu_m_l - Follicle Stimulating Hormone level. Normal ranges:
- Before puberty: 0–4.0
- During puberty: 0.3–10.0
- Menstruating women: 4.7–21.5
- After menopause: 25.8–134.8
lh_m_iu_m_l - Luteinizing Hormone, plays a crucial role in regulating the reproductive system.
hip_inch - Hip size in inches
waist_inch - Waist size in inches
tsh_m_iu_l - Thyroid-Stimulating Hormone, which determines if the thyroid gland is functioning properly
amh_ng_m_l_x - Anti-Mullerian hormone, which measures the egg count in the ovaries.
prl_ng_m_l - The prolactin level, which supports milk production after childbirth.
vit_d3_ng_m_l - Vitamin D3 level.
prg_ng_m_l - Progesterone level.
rbs_mg_dl - Random blood sugar level.
weight_gain_y_n - Whether the patient has experienced weight gain.
hair_growth_y_n - Whether the patient has experienced hair growth.
skin_darkening_y_n - Whether the patient has experienced skin darkening.
hair_loss_y_n - Whether the patient has experienced hair loss.
pimples_y_n - Whether the patient has experienced pimples.
fast_food_y_n - Whether fast food is a regular part of the patient’s diet.
reg_exercise_y_n - Whether the patient exercises regularly.
bp_systolic_mm_hg - Systolic blood pressure, which measures pressure in the arteries when the heart beats.
bp_diastolic_mm_hg - Diastolic blood pressure, which measures pressure between heartbeats.
follicle_no_l - Number of follicles in the left ovary.
follicle_no_r - Number of follicles in the right ovary.
avg_f_size_l_mm - Average follicle size in the left ovary (mm).
avg_f_size_r_mm - Average follicle size in the right ovary (mm).
endometrium_mm - Thickness of the endometrium (mm).

Loading Packages

library(readr)
library(readxl)
library(dplyr)
library(janitor)
library(naniar)
library(corrplot)
library(ggplot2)
library(reshape2)
library(tidymodels)
library(patchwork)
library(discrim)
library(rpart)
library(kernlab)
library(klaR)
library(ranger)
library(vip)

Loading Dataset

# Load the datasets
pcos_without_infertility <- read_excel("data/pcos.xlsx")
pcos_infertility <- read_csv("data/pcos_infertility.csv")

# Merge the two datasets by Patient File Number
pcos <- left_join(pcos_without_infertility, pcos_infertility, by = "Patient File No.")

Cleaning the Data

Now that we’ve uploaded the two datasets and merged them into one called pcos, we can start cleaning the data to prepare it for our predictive modeling.

# Tidy the column names
pcos <- clean_names(pcos)

# View the first 10 rows of the dataset
knitr::kable(head(pcos, 10))

sl_no_x	patient_file_no	pcos_y_n_x	age_yrs	weight_kg	height_cm	bmi	blood_group	pulse_rate_bpm	rr_breaths_min	hb_g_dl	cycle_r_i	cycle_length_days	marraige_status_yrs	pregnant_y_n	no_of_aborptions	i_beta_hcg_m_iu_m_l_x	ii_beta_hcg_m_iu_m_l_x	fsh_m_iu_m_l	lh_m_iu_m_l	fsh_lh	hip_inch	waist_inch	waist_hip_ratio	tsh_m_iu_l	amh_ng_m_l_x	prl_ng_m_l	vit_d3_ng_m_l	prg_ng_m_l	rbs_mg_dl	weight_gain_y_n	hair_loss_y_n	pimples_y_n	fast_food_y_n	bp_systolic_mm_hg	bp_diastolic_mm_hg	follicle_no_l	follicle_no_r	avg_f_size_l_mm	avg_f_size_r_mm	endometrium_mm	x45	sl_no_y	pcos_y_n_y	i_beta_hcg_m_iu_m_l_y	ii_beta_hcg_m_iu_m_l_y	amh_ng_m_l_y
1	1	0	28	44.6	152.0	19.3	15	78	22	10.48	2	5	7	0	0	1.99	1.99	7.95	3.68	NA	36	30	NA	0.68	2.0699999999999998	45.16	17.1	0.57	92	0	0	0	1	110	80	3	3	18	18	8.5	NA	NA	NA	NA	NA	NA
2	2	0	36	65.0	161.5	NA	15	74	20	11.70	2	5	11	1	0	60.80	1.99	6.73	1.09	NA	38	32	NA	3.16	1.53	20.09	61.3	0.97	92	0	0	0	0	120	70	3	5	15	14	3.7	NA	NA	NA	NA	NA	NA
3	3	1	33	68.8	165.0	NA	11	72	18	11.80	2	5	10	1	0	494.08	494.08	5.54	0.88	NA	40	36	NA	2.54	6.63	10.52	49.7	0.36	84	0	1	1	1	120	80	13	15	18	20	10.0	NA	NA	NA	NA	NA	NA
4	4	0	37	65.0	148.0	NA	13	72	20	12.00	2	5	4	0	0	1.99	1.99	8.06	2.36	NA	42	36	NA	16.41	1.22	36.90	33.4	0.36	76	0	0	0	0	120	70	2	2	15	14	7.5	NA	NA	NA	NA	NA	NA
5	5	0	25	52.0	161.0	NA	11	72	18	10.00	2	5	1	1	0	801.45	801.45	3.98	0.90	NA	37	30	NA	3.57	2.2599999999999998	30.09	43.8	0.38	84	0	1	0	0	120	80	3	4	16	14	7.0	NA	NA	NA	NA	NA	NA
6	6	0	36	74.1	165.0	NA	15	78	28	11.20	2	5	8	1	0	237.97	1.99	3.24	1.07	NA	44	38	NA	1.60	6.74	16.18	52.4	0.30	76	1	1	0	0	110	70	9	6	16	20	8.0	NA	NA	NA	NA	NA	NA
7	7	0	34	64.0	156.0	NA	11	72	18	10.90	2	5	2	0	0	1.99	1.99	2.85	0.31	NA	39	33	NA	1.51	3.05	26.41	42.7	0.46	93	0	0	0	0	120	80	6	6	15	16	6.8	NA	NA	NA	NA	NA	NA
8	8	0	33	58.5	159.0	NA	13	72	20	11.00	2	5	13	1	2	100.51	100.51	4.86	3.07	NA	44	38	NA	12.18	1.54	3.97	38.0	0.26	91	1	0	0	0	120	80	7	6	15	18	7.1	NA	NA	NA	NA	NA	NA
9	9	0	32	40.0	158.0	NA	11	72	18	11.80	2	5	8	0	1	1.99	1.99	3.76	3.02	NA	39	35	NA	1.51	1	19.00	21.8	0.30	116	0	0	0	0	120	80	5	7	17	17	4.2	NA	NA	NA	NA	NA	NA
10	10	0	36	52.0	150.0	NA	15	80	20	10.00	4	2	4	0	0	1.99	1.99	2.80	1.51	NA	40	38	NA	6.65	1.61	11.74	27.7	0.25	125	0	0	0	0	110	80	1	1	14	17	2.5	NA	NA	NA	NA	NA	NA

From the table above, we can see that the values in the columns sl_no_x and patient_file_no are repetitive, so we’ll remove the sl_no_x column.

pcos <- pcos |> 
  dplyr::select(-sl_no_x)

The values in the ii_beta_hcg_m_iu_m_l_x and amh_ng_m_l_x columns are currently stored as characters, so we’ll convert them to numeric since they measure hormone levels. We’ll also change columns that represent categories such as yes or no answers and blood type from numeric to factors.

pcos <- pcos |>
  mutate(
    pcos_y_n_x = as.factor(pcos_y_n_x),
    ii_beta_hcg_m_iu_m_l_x = as.numeric(ii_beta_hcg_m_iu_m_l_x),
    amh_ng_m_l_x = as.numeric(amh_ng_m_l_x),
    blood_group = as.factor(blood_group),
    cycle_r_i = as.factor(cycle_r_i),
    pregnant_y_n = as.factor(pregnant_y_n),
    weight_gain_y_n = as.factor(weight_gain_y_n),
    hair_growth_y_n = as.factor(hair_growth_y_n),
    skin_darkening_y_n = as.factor(skin_darkening_y_n),
    hair_loss_y_n = as.factor(hair_loss_y_n),
    pimples_y_n = as.factor(pimples_y_n),
    fast_food_y_n = as.factor(fast_food_y_n),
    reg_exercise_y_n = as.factor(reg_exercise_y_n)
  )

knitr::kable(head(pcos,10))

patient_file_no	pcos_y_n_x	age_yrs	weight_kg	height_cm	bmi	blood_group	pulse_rate_bpm	rr_breaths_min	hb_g_dl	cycle_r_i	cycle_length_days	marraige_status_yrs	pregnant_y_n	no_of_aborptions	i_beta_hcg_m_iu_m_l_x	ii_beta_hcg_m_iu_m_l_x	fsh_m_iu_m_l	lh_m_iu_m_l	fsh_lh	hip_inch	waist_inch	waist_hip_ratio	tsh_m_iu_l	amh_ng_m_l_x	prl_ng_m_l	vit_d3_ng_m_l	prg_ng_m_l	rbs_mg_dl	weight_gain_y_n	hair_loss_y_n	pimples_y_n	fast_food_y_n	bp_systolic_mm_hg	bp_diastolic_mm_hg	follicle_no_l	follicle_no_r	avg_f_size_l_mm	avg_f_size_r_mm	endometrium_mm	x45	sl_no_y	pcos_y_n_y	i_beta_hcg_m_iu_m_l_y	ii_beta_hcg_m_iu_m_l_y	amh_ng_m_l_y
1	0	28	44.6	152.0	19.3	15	78	22	10.48	2	5	7	0	0	1.99	1.99	7.95	3.68	NA	36	30	NA	0.68	2.07	45.16	17.1	0.57	92	0	0	0	1	110	80	3	3	18	18	8.5	NA	NA	NA	NA	NA	NA
2	0	36	65.0	161.5	NA	15	74	20	11.70	2	5	11	1	0	60.80	1.99	6.73	1.09	NA	38	32	NA	3.16	1.53	20.09	61.3	0.97	92	0	0	0	0	120	70	3	5	15	14	3.7	NA	NA	NA	NA	NA	NA
3	1	33	68.8	165.0	NA	11	72	18	11.80	2	5	10	1	0	494.08	494.08	5.54	0.88	NA	40	36	NA	2.54	6.63	10.52	49.7	0.36	84	0	1	1	1	120	80	13	15	18	20	10.0	NA	NA	NA	NA	NA	NA
4	0	37	65.0	148.0	NA	13	72	20	12.00	2	5	4	0	0	1.99	1.99	8.06	2.36	NA	42	36	NA	16.41	1.22	36.90	33.4	0.36	76	0	0	0	0	120	70	2	2	15	14	7.5	NA	NA	NA	NA	NA	NA
5	0	25	52.0	161.0	NA	11	72	18	10.00	2	5	1	1	0	801.45	801.45	3.98	0.90	NA	37	30	NA	3.57	2.26	30.09	43.8	0.38	84	0	1	0	0	120	80	3	4	16	14	7.0	NA	NA	NA	NA	NA	NA
6	0	36	74.1	165.0	NA	15	78	28	11.20	2	5	8	1	0	237.97	1.99	3.24	1.07	NA	44	38	NA	1.60	6.74	16.18	52.4	0.30	76	1	1	0	0	110	70	9	6	16	20	8.0	NA	NA	NA	NA	NA	NA
7	0	34	64.0	156.0	NA	11	72	18	10.90	2	5	2	0	0	1.99	1.99	2.85	0.31	NA	39	33	NA	1.51	3.05	26.41	42.7	0.46	93	0	0	0	0	120	80	6	6	15	16	6.8	NA	NA	NA	NA	NA	NA
8	0	33	58.5	159.0	NA	13	72	20	11.00	2	5	13	1	2	100.51	100.51	4.86	3.07	NA	44	38	NA	12.18	1.54	3.97	38.0	0.26	91	1	0	0	0	120	80	7	6	15	18	7.1	NA	NA	NA	NA	NA	NA
9	0	32	40.0	158.0	NA	11	72	18	11.80	2	5	8	0	1	1.99	1.99	3.76	3.02	NA	39	35	NA	1.51	1.00	19.00	21.8	0.30	116	0	0	0	0	120	80	5	7	17	17	4.2	NA	NA	NA	NA	NA	NA
10	0	36	52.0	150.0	NA	15	80	20	10.00	4	2	4	0	0	1.99	1.99	2.80	1.51	NA	40	38	NA	6.65	1.61	11.74	27.7	0.25	125	0	0	0	0	110	80	1	1	14	17	2.5	NA	NA	NA	NA	NA	NA

Missing Data

# Visual the missing data
pcos |>
  vis_miss()

Above, we can see that bmi, fsh_lh, waist_hip_ratio, x45, sl_no_y, pcos_y_n_y, i_beta_hcg_m_iu_m_l_y, ii_beta_hcg_m_iu_m_l_y, and amh_ng_m_l_y are all missing at least 50 percent of their data. Since we have each patient’s weight and height, we can manually calculate their BMI. However, the remaining columns must be removed, as keeping variables with that many missing values will likely cause errors in our model.

# Remove columns with significant missing data and calculate bmi
pcos <- pcos |>
  dplyr::select(-c(fsh_lh, waist_hip_ratio, x45, sl_no_y, pcos_y_n_y, i_beta_hcg_m_iu_m_l_y,
            ii_beta_hcg_m_iu_m_l_y, amh_ng_m_l_y)) |>
  mutate(bmi = (weight_kg / height_cm^2) * 10000)

# total number of missing values
sum(is.na(pcos))

[1] 4

Since only 4 observations of the data are missing, we’ll go ahead and drop these rows. After cleaning the dataset, we’re left with 537 observations and 41 columns.

# Remove missing observations
pcos <- pcos |>
  drop_na()
# Print dimensions of datasets
dim(pcos)

[1] 537  41

Visual Data Analysis

The bar plot below shows the distribution of women diagnosed with polycystic ovary syndrome (PCOS) across the 10 hospitals. In the dataset, approximately 175 women have been diagnosed with PCOS (“Yes”), while slightly over 350 women have not (“No”). This indicates that the number of women without a PCOS diagnosis is almost double that of those diagnosed.

pcos |>
  ggplot(mapping = aes(x = pcos_y_n_x)) +
  geom_bar(fill = "purple4") +
  scale_x_discrete(labels = c("0"= "No", "1" = "Yes")) +
  labs(title = "Distribution of PCOS Diagnosis", x = "PCOS", y = "Frequency")

Correlation Plot

The correlation plot below explores relationships between the numeric variables in the dataset. As expected, there is a strong positive correlation between weight and BMI, since weight is used to calculate BMI. Weight also shows strong correlations with both hip and waist measurements, suggesting that these areas are more affected by weight gain. While there aren’t any strong negative correlations, there is a mild negative relationship between age and the number of follicles in both the left and right ovaries, which could indicate a decline in ovarian activity with age. Overall, most hormone levels and characteristic variables show weak correlations with one another, limiting the connections we can draw between them.

pcos |>
  dplyr::select(where(is.numeric)) |>
  cor() |> 
  corrplot(type = "lower", diag = F,method = "color", tl.col = "black", tl.srt = 30, order = "AOE")

PCOS Diagnosis Criteria

Recall that a diagnosis of PCOS requires at least two of the following criteria to be met:

Absent or irregular menstrual cycles
High levels of androgens
Growth of follicles of a specific size in at least one ovary

Although our dataset doesn’t directly measure androgen levels, it does include information on irregular menstrual cycles and follicle sizes. In the next section, we’ll take a closer look at how these factors relate to PCOS in our data.

Menstrual Cycle

pcos |>
  ggplot(aes(x=cycle_r_i, fill=pcos_y_n_x))+
  geom_bar(position = "fill") +
  scale_x_discrete(labels = c("2" = "Regular", "4" = "Irregular")) +
  scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
  scale_fill_manual(values = c("royalblue", "red4")) + 
  labs(x = "cycle", fill = "PCOS Diagnosis")

The bar chart above shows the correlation between having irregular periods and being diagnosed with PCOS in our dataset. We can see that less than 25% of individuals with regular periods have been diagnosed with PCOS, compared to nearly 65% of patients who have both irregular periods and PCOS. As expected, this suggests a strong relationship between irregular menstrual cycles and a PCOS diagnosis.

Follicle Size

r_follicle <- pcos |>
  ggplot(aes(x=follicle_no_l, fill=pcos_y_n_x))+
  geom_bar(position = "fill") +
  scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
  scale_fill_manual(values = c("royalblue", "red4")) + 
  labs(x = "Left Follicle Size", fill = "PCOS Diagnosis")

l_follicle <- pcos |>
  ggplot(aes(x=follicle_no_r, fill=pcos_y_n_x))+
  geom_bar(position = "fill") +
  scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
  scale_fill_manual(values = c("royalblue", "red4")) + 
  labs(x = "Right Follicle Size", fill = "PCOS Diagnosis")

r_follicle/ l_follicle

The graph above highlights the strong correlation between larger follicle size and a PCOS diagnosis. The two graphs above illustrate the follicle size in both ovaries, and we can see a clear linear increase in PCOS cases as follicle size grows. In fact, when follicle sizes exceed 15 millimeters in both the right and left ovaries, every patient in the dataset was diagnosed with PCOS.

Symptoms and Impact of PCOS

Now that we’ve explored the correlation between follicle size and irregular menstrual cycles, both important for diagnosis, let’s take a look at common symptoms or effects experienced by individuals with PCOS.

BMI

pcos |>
  ggplot(aes(x = bmi, fill = pcos_y_n_x)) +
  geom_boxplot() +
  scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
  scale_fill_manual(values = c("royalblue", "red4")) + 
  labs(x = "BMI (kg/m²)", fill = "PCOS")

Obesity is commonly linked with PCOS, so the boxplot above shows the BMI distribution for individuals with and without the condition. BMI (Body Mass Index) measures the body fat based on weight and height, and a normal range is typically under 25 kg/m². From the plot, we can see that people with PCOS tend to have a slightly higher BMI on average, but the difference isn’t extremely large. This makes sense, since it’s possible to have PCOS and still be at a normal weight.

Anti-Mullerian Hormone (AMH)

pcos |>
  ggplot(aes(x = amh_ng_m_l_x, fill = pcos_y_n_x)) +
  geom_boxplot() +
  scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
  scale_fill_manual(values = c("royalblue", "red4")) + 
  labs(x = "AMH", fill = "PCOS")

Anti-Mullerian Hormone (AMH) is a hormone that helps indicate the number and quality of a woman’s eggs. Individuals with PCOS often have higher AMH levels because they tend to have more follicles in their ovaries. In the plot above, we can see that both the average and overall distribution of AMH levels are higher for those with PCOS compared to those without.

Acne

pcos |>
  ggplot(aes(x = pimples_y_n, fill = pcos_y_n_x)) +
  geom_bar(position = "fill") +
  scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
  scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
  scale_fill_manual(values = c("royalblue", "red4")) + 
  labs(x = "Acne", fill = "PCOS Diagnosis")

Women with PCOS tend to have higher levels of androgens, which can lead to hormonal imbalances. One common effect of this imbalance is acne, which can be more extreme and harder to treat. In the barplot above, we see that almost 50% of individuals with PCOS report experiencing acne, compared to less than 25% of those without the condition.

Hair Growth

pcos |>
  ggplot(aes(x = hair_growth_y_n, fill = pcos_y_n_x)) +
  geom_bar(position = "fill") +
  scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
  scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
  scale_fill_manual(values = c("royalblue", "red4")) + 
  labs(x = "Hair Growth", fill = "PCOS Diagnosis")

Another effect of hormonal imbalances in PCOS is excessive hair growth. In the barplot above, we can see that around 70% of individuals with PCOS report increased hair growth, compared to roughly 20% of those without the condition. This is a significant difference, suggesting that hair growth is one of the more common symptoms associated with PCOS.

Skin Darkening

pcos |>
  ggplot(aes(x = skin_darkening_y_n, fill = pcos_y_n_x)) +
  geom_bar(position = "fill") +
  scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
  scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
  scale_fill_manual(values = c("royalblue", "red4")) + 
  labs(x = "Skin Darkening", fill = "PCOS Diagnosis")

Skin darkening is a known symptom of PCOS and often appears as dark patches on areas like the neck and armpits. This occurs because of elevated levels of androgens and insulin. In our dataset, around 70% of individuals with PCOS reported skin darkening, compared to about 20% of those without the condition. This significant difference suggests skin darkening is another common symptom of PCOS.

Weight Gain

pcos |>
  ggplot(aes(x = weight_gain_y_n, fill = pcos_y_n_x)) +
  geom_bar(position = "fill") +
  scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
  scale_fill_discrete(labels = c("0" = "No", "1" = "Yes")) +
  scale_fill_manual(values = c("royalblue", "red4")) + 
  labs(x= "Weight Gain", fill = "PCOS Diagnosis")

As mentioned earlier, obesity is commonly associated with PCOS. The plot above shows the distribution of individuals with PCOS that reported experiencing weight gain after being diagnosed. Around 60% of patients reported gaining weight, highlighting how PCOS can contribute to changes in body weight.

Model Set-Up

Now that we’ve explored how multiple variables relate to PCOS diagnosis, we can begin building our predictive models. This process includes creating training and testing datasets, setting up a recipe, and applying cross-validation to ensure the models accuracy.

Train and Test Split

Before modeling, we need to split the dataset into training and testing sets. The training data will be used to fit the model, while the testing data will help determine how well the model performs on new data. To ensure we get the same split each time we run the code, we’ll start by setting a random seed. I’ll set the split proportion to 0.7, so that 70% of the data is used for training. This will allow the model to have enough data to learn from. The split will also be stratified based on the pcos_y_n_x variable to keep the same proportion of PCOS diagnoses in both the training and testing sets.

set.seed(38821)
pcos_split <- initial_split(pcos, strata = pcos_y_n_x, prop = 0.70)

pcos_train <- training(pcos_split)
pcos_test <- testing(pcos_split)

Now, lets confirm that the data was split correctly by checking the proportion of the training and testing datasets. Below we will divide the total rows in each dataset by the number of rows in our original, cleaned dataset.

Training Data Dimensions:

dim(pcos_train)[1]/dim(pcos)[1]

[1] 0.698324

Testing Data Dimensions:

dim(pcos_test)[1]/dim(pcos)[1]

[1] 0.301676

Recipe

Now we can create our recipe using the predictors and response variable. First, we’ll remove a few columns that aren’t helpful for the model:

patient_file_no, since it’s a random ID number
avg_f_size_r_mm and avg_f_size_l_mm, since we’re already using the total number of follicles in each ovary

Because the dataset includes several categorical predictors, we’ll use step_dummy() to convert them into the correct format. Additionally, we’ll scale and center all numeric predictors using step_normalize() to ensure they’re on the same scale.

pcos_recipe <- recipe(pcos_y_n_x ~ ., data = pcos_train) |>
  step_rm(patient_file_no, avg_f_size_r_mm, avg_f_size_l_mm) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

K-Fold Cross Validation

In order to ensure the data remains balanced across each folds, we will use stratified cross-validation. Specifically, we will stratify pcos_y_n_x so that each fold contains a similar proportion of “Yes” and “No” cases.

set.seed(38821) 

pcos_fold <- vfold_cv(pcos_train, v = 10, strata = pcos_y_n_x)

Specify Model and Engine

Now that we have created our recipe and stratified the data, we can begin building our models. In the sections below, we will build and evaluate a Linear Discriminant Analysis, Quadratic Discriminant Analysis, Decision Tree, and Support Vector Machine model. In order to build these model, we will:

Create the model, specifying the mode (classification) and engine.
Create a workflow, adding the model from step 1 and the pcos_recipe we built earlier.
For Decision Tree and SVM, we will tune the model to explore different parameters using cross-validation and select the best version.
Fit the workflow to the training dataset.
Visualize model performance to see how well the model preformed.

Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) identifies the linear combination of predictors that best differentiate the classes, which in this case is whether an individual has PCOS or not. We start by creating the LDA model, then add it and the recipe to a workflow. Next, we fit the workflow to the training data. The heatmap reveals that the model performed well in classifying PCOS diagnoses, with 13 false positives (incorrectly predicting “Yes”) and 22 false negatives (incorrectly predicting “No”).

lda_mod <- discrim_linear() |>
  set_mode("classification") |>
  set_engine("MASS")

pcos_wkflow_lda <- workflow() |>
  add_model(lda_mod) |>
  add_recipe(pcos_recipe)

pcos_lda_fit <- fit(pcos_wkflow_lda, pcos_train)

pcos_lda_preds <- augment(pcos_lda_fit, new_data = pcos_train) |>
  conf_mat(truth = pcos_y_n_x, estimate = .pred_class) |> 
  autoplot(type = "heatmap")

pcos_lda_preds

Quadratic Discriminant Analysis (QDA)

Quadratic Discriminant Analysis (QDA) learns quadratic boundaries between classes and assumes each class has its own covariance structure, making it more flexible than LDA. Similar to LDA, we start by creating the QDA model, then add it and the recipe to a workflow, and finally fit the workflow to the training data. As expected, QDA performed slightly better than LDA, with only 4 false positives and 19 false negatives.

qda_mod <- discrim_quad() |>
  set_mode("classification") |>
  set_engine("MASS")

pcos_wkflow_qda <- workflow() |>
  add_model(qda_mod) |>
  add_recipe(pcos_recipe)

pcos_qda_fit <- fit(pcos_wkflow_qda, pcos_train)

pcos_qda_preds <- augment(pcos_qda_fit, new_data = pcos_train) |>
  conf_mat(truth = pcos_y_n_x, estimate = .pred_class) |> 
  autoplot(type = "heatmap")

pcos_qda_preds

Decision Tree

Decision Tree splits the data into smaller groups based on the values of predictor variables and chooses cutoffs to separate the classes, which in this case is whether someone has PCOS or not. Its final structure looks like a tree, which is where the name comes from. We start by creating the Decision Tree model, then add it and the recipe to a workflow. Next, we tune the model to find the best parameters before fitting the finalized workflow to the training data. The Cost-Complexity plot shows that the model performs fairly consistently across different complexity values. However, performance declines at higher values towards the end, suggesting reduced efficiency in prediction.

dt_mod <- decision_tree(cost_complexity = tune()) |>
  set_mode("classification") |>
  set_engine("rpart")

pcos_wkflow_dt <- workflow() |>
  add_model(dt_mod) |>
  add_recipe(pcos_recipe)

dt_grid <- grid_regular(cost_complexity(), levels = 10)

dt_tune_res <- tune_grid(
  pcos_wkflow_dt,
  resamples = pcos_fold,
  grid = dt_grid
)

best_dt <- select_best(dt_tune_res, metric = "accuracy")

final_dt_wkflow <- finalize_workflow(pcos_wkflow_dt, best_dt)

pcos_dt_fit <- fit(final_dt_wkflow, pcos_train)

autoplot(dt_tune_res, metric = "roc_auc")

Support Vector Machine (SVM)

Support Vector Machine (SVM) creates a hyperplane that finds the best boundary to separate the data using support vectors (the data points closest to that boundary).Similar to the Decision tree model, we will create the SVM model, then add it and the recipe to a workflow. Next, we tune the model to find the best parameters before fitting the finalized workflow to the training data. The cost plot shows how the model performs with different cost parameters, where the highest point indicates the most accurate model.

svm_mod <- svm_rbf(cost = tune()) |>
  set_mode("classification") |>
  set_engine("kernlab")

pcos_wkflow_svm <- workflow() |>
  add_model(svm_mod) |>
  add_recipe(pcos_recipe)

svm_grid <- grid_regular(cost(), levels = 10)

svm_tune_res <- tune_grid(
  pcos_wkflow_svm, 
  resamples = pcos_fold,
  grid = svm_grid)

best_svm <- select_best(svm_tune_res, metric = "accuracy")

final_svm_wkflow <- finalize_workflow(pcos_wkflow_svm, best_svm)

pcos_svm_fit <- fit(final_svm_wkflow, pcos_train)

svm_tune_res |>
  autoplot(metric = "roc_auc")

Model Accuracy

Now that we’ve finished creating the models and fitting them to the training data, we can compare their performance using ROC AUC values. To do this, we’ll create a tibble that displays each model’s ROC AUC score.

pcos_lda_auc <- augment(pcos_lda_fit, new_data = pcos_train) |>
  roc_auc(truth = pcos_y_n_x, .pred_0) |>
  dplyr::select(.estimate)

pcos_qda_auc <- augment(pcos_qda_fit, new_data = pcos_train) |>
  roc_auc(truth = pcos_y_n_x, .pred_0) |>
  dplyr::select(.estimate)

pcos_dt_auc <- augment(pcos_dt_fit, new_data = pcos_train) |>
  roc_auc(truth = pcos_y_n_x, .pred_0) |>
  dplyr::select(.estimate)

pcos_svm_auc <- augment(pcos_svm_fit, new_data = pcos_train) |>
  roc_auc(truth = pcos_y_n_x, .pred_0) |>
  dplyr::select(.estimate)

pcos_roc_aucs <- c(pcos_lda_auc$.estimate,
                   pcos_qda_auc$.estimate,
                   pcos_dt_auc$.estimate,
                   pcos_svm_auc$.estimate)

pcos_mod_names <- c("LDA",
                    "QDA",
                    "Decision Tree",
                    "Support Vector Machine")

pcos_results <- tibble(Model = pcos_mod_names,
                       roc_auc = pcos_roc_aucs)

pcos_results |>
  arrange(-pcos_roc_aucs)

# A tibble: 4 × 2
  Model                  roc_auc
  <chr>                    <dbl>
1 QDA                      0.989
2 Support Vector Machine   0.978
3 LDA                      0.971
4 Decision Tree            0.927

Wow! The ROC AUC values were pretty similar across all models, and all were above 0.90. However, QDA had the highest performance with a ROC AUC of 0.9888227. While that is great, it is important to remember that these results are based on the training data, and we still need to see how the models perform on the testing set. Since the ROC AUC scores for QDA and LDA were very close, we will continue working with these two to determine which one truly performs best. First, below is a visual comparison of the ROC AUC performance for the LDA, QDA, Decision Tree, and SVM models.”

ggplot(pcos_results, 
       aes(x = Model, y = roc_auc)) + 
  geom_bar(stat = "identity", fill = "light blue")

Final ROC AUC Results

Since the Quadratic Discriminant Analysis (QDA) model performed the best on the training dataset, we will apply the testing dataset to this model first. Below, we use the same augment function as before to get the ROC AUC score, but this time we use the testing data as our new_data. This gives us a score of 0.8273325, which is a significant drop from the training score. While the ROC AUC score is still solid, the decrease suggests that the model may not predict as well on new, unseen data.

pcos_qda_auc_test <- augment(pcos_qda_fit, new_data = pcos_test) |>
  roc_auc(truth = pcos_y_n_x, .pred_0) |>
  dplyr::select(.estimate)

pcos_qda_auc_test

# A tibble: 1 × 1
  .estimate
      <dbl>
1     0.827

Below we have the ROC AUC curve. A predict model would have sort of an L shape thats point touches one. The QDA curve ___

Below we have the ROC AUC curve. A perfect model would have a sharp L-shape that touches the point (0, 1), indicating perfect sensitivity and specificity. The QDA curve forms that L-shape fairly closely, suggesting strong performance, however it lacks the sharpness of a perfect model.

augment(pcos_qda_fit, new_data = pcos_test) |>
  roc_curve(truth = pcos_y_n_x, .pred_0) |>
  autoplot()

Now, we will be calculating the ROC AUC scores for the LDA model against the testing dataset. We will follow the same steps as the QDA, using augment and the testing set. We are then given a score of 0.9529167 for LDA. This model held up a lot better! With this, we can conclude that LDA is our most accurate model for predicting whether an individual has PCOS or not. While QDA performed best during training, its performance dropped significantly on new data, suggesting that LDA makes more accurate predictions on unseen data.

pcos_lda_auc_test <- augment(pcos_lda_fit, new_data = pcos_test) |>
  roc_auc(truth = pcos_y_n_x, .pred_0) |>
  dplyr::select(.estimate)

pcos_roc_aucs_test <- c(pcos_lda_auc_test$.estimate)

pcos_mod_names_test <- c("LDA")

pcos_test_res <- tibble(Model = pcos_mod_names_test,
                       roc_auc = pcos_roc_aucs_test)

pcos_test_res |>
  arrange(-pcos_roc_aucs_test)

# A tibble: 1 × 2
  Model roc_auc
  <chr>   <dbl>
1 LDA     0.953

Below is the LDA ROC AUC curve. Again, a perfect model would have a sharp L-shape, which we can see is closely replicated in the LDA model.

augment(pcos_lda_fit, new_data = pcos_test) |>
  roc_curve(truth = pcos_y_n_x, .pred_0) |>
  autoplot() +
  ggtitle("LDA Model")

Conclusion

After learning about PCOS, analyzing patient data from 10 hospitals in India, and building several models to predict who might have the condition, we found that the Linear Discriminant Analysis (LDA) model performed the best. It achieved 95% accuracy, outperforming the Quadratic Discriminant Analysis, Decision Tree, and Support Vector Machine models. Using 41 variables from the dataset, LDA was the most effective at making predictions. However, there’s still room to improve the model’s performance in the future.

While QDA performed well during training, its drop in accuracy during testing suggests it may have overfit the training data. To prevent this in the future, carefully selecting variables that are not strongly correlated and reducing the overall number of features used could help. Additionally, trying more advanced models like Random Forests may be beneficial, as they are better at identifying nonlinear relationships and tend to perform well with large numbers of variables.

Also, expanding the dataset to include a larger and more diverse population, along with adding variables like genetics and family history, could make the model more accurate and the predictions more personalized. This would allow the model to apply to a broader group while improving accuracy, helping individuals better understand their own risk of PCOS. Altogether, these improvements could lead to more reliable predictions, earlier detection, and better support for those dealing with or at risk for PCOS.

Working on this project helped me truly understand the medical factors involved in identifying polycystic ovary syndrome (PCOS). I’ve personally experienced several of these symptoms, but they were often overlooked or dismissed when I spoke up about them. This process really opened my eyes to how important it is to raise awareness about conditions like PCOS and to speak out for those who might not know how.

I’m really proud of how well the model performed. Knowing that my very first machine learning project can help identify PCOS feels like a huge accomplishment. It motivates me to keep learning and improving so I can develop tools that genuinely make a difference in people’s health and lives.

Sources

Christ, Jacob P., and Marcelle I. Cedars. “Current guidelines for diagnosing PCOS.” Diagnostics 13.6 (2023): 1113.
Coffey, S., and H. Mason. “The effect of polycystic ovary syndrome on health-related quality of life.” Gynecological Endocrinology 17.5 (2003): 379-386.
John Hopkins Medicine. “Polycystic Ovary Syndrome (PCOS).” John Hopkins Medicine, www.hopkinsmedicine.org/health/conditions-and-diseases/polycystic-ovary-syndrome-pcos.
Mayo Clinic Staff. “Polycystic Ovary Syndrome (PCOS) - Symptoms and Causes.” Mayo Clinic, 8 Sept. 2022, www.mayoclinic.org/diseases-conditions/pcos/symptoms-causes/syc-20353439.
https://www.mountsinai.org/health-library/tests/hemoglobin
Mount Sinai. “Hemoglobin Information | Mount Sinai - New York.” Mount Sinai Health System, 2024, www.mountsinai.org/health-library/tests/hemoglobin.
MHS, Kristina Liu, MD, and Janelle Nassim MD. “Polycystic Ovarian Syndrome and the Skin.” Harvard Health, 29 Apr. 2021, www.health.harvard.edu/blog/polycystic-ovarian-syndrome-and-the-skin-202104292552.