Statistical Analysis of Polycystic Ovary Syndrome (PCOS) Indicators

Exploring Hormonal and Lifestyle Factors Associated with PCOS

Prerana Ramchandra: s4058630 and Chandangowda Maruvanahalli Shivaramu: s4063920

Last updated: 18 October, 2024

Introduction

Problem Statement

Data introduction and sourcing

Importing Data

pcos <- read.csv("D:/app_ana/pcos.csv")

head(pcos)

Data Description

Important Variables:

  1. Age (yrs): Age of the patient (Numeric)
  2. Weight (Kg): Weight in kilograms (Numeric)
  3. Height (Cm): Height in centimeters (Numeric)
  4. BMI: Body Mass Index, calculated using weight and height (Numeric)
  5. PCOS (Y/N): Indicates whether the individual has PCOS (1 for Yes, 0 for No) (Categorical)
  6. FSH(mIU/mL): Follicle Stimulating Hormone levels (Numeric)
  7. LH(mIU/mL): Luteinizing Hormone levels (Numeric)
  8. TSH (mIU/L): Thyroid-Stimulating Hormone levels (Numeric)

Additional Variables: Additionally, there are other variables like Random Blood Sugar (RBS), Waist-Hip Ratio, Endometrium thickness, Marriage status (years), Pregnancy status (Y/N), Number of abortions, Blood Group, and various lifestyle indicators such as weight gain, hair growth, skin darkening, pimples, fast food consumption, and regular exercise.

Data Preprocessing

Inspecting the dataset by checking their datatypes

data_types <- sapply(pcos, class)
data_type_count <- table(data_types)
# Print the count of each data type
print(data_type_count)
## data_types
## character   integer   numeric 
##         3        24        18
pcos$PCOS..Y.N. <- as.factor(pcos$PCOS..Y.N.)
pcos$Cycle.R.I. <- as.factor(pcos$Cycle.R.I.)
pcos$Pregnant.Y.N. <- as.factor(pcos$Pregnant.Y.N.)
pcos$Weight.gain.Y.N. <- as.factor(pcos$Weight.gain.Y.N.)
pcos$hair.growth.Y.N. <- as.factor(pcos$hair.growth.Y.N.)
pcos$Skin.darkening..Y.N. <- as.factor(pcos$Skin.darkening..Y.N.)
pcos$Hair.loss.Y.N. <- as.factor(pcos$Hair.loss.Y.N.)
pcos$Pimples.Y.N. <- as.factor(pcos$Pimples.Y.N.)
pcos$Fast.food..Y.N. <- as.factor(pcos$Fast.food..Y.N.)
pcos$Reg.Exercise.Y.N. <- as.factor(pcos$Reg.Exercise.Y.N.)

“1” signifying “Yes” and “0” signifying “No”

Data Preprocessing:

## [1] "Marraige.Status..Yrs." "Fast.food..Y.N."
# Load necessary package for calculating mode
get_mode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Replace missing values in Marraige.Status..Yrs. with mean based on PCOS..Y.N.
pcos$Marraige.Status..Yrs.[is.na(pcos$Marraige.Status..Yrs.) & pcos$PCOS..Y.N. == 1] <- 
  mean(pcos$Marraige.Status..Yrs.[pcos$PCOS..Y.N. == 1], na.rm = TRUE)

pcos$Marraige.Status..Yrs.[is.na(pcos$Marraige.Status..Yrs.) & pcos$PCOS..Y.N. == 0] <- 
  mean(pcos$Marraige.Status..Yrs.[pcos$PCOS..Y.N. == 0], na.rm = TRUE)

# Replace missing values in Fast.food..Y.N. with mode based on PCOS..Y.N.
pcos$Fast.food..Y.N.[is.na(pcos$Fast.food..Y.N.) & pcos$PCOS..Y.N. == 1] <- 
  get_mode(pcos$Fast.food..Y.N.[pcos$PCOS..Y.N. == 1])

pcos$Fast.food..Y.N.[is.na(pcos$Fast.food..Y.N.) & pcos$PCOS..Y.N. == 0] <- 
  get_mode(pcos$Fast.food..Y.N.[pcos$PCOS..Y.N. == 0])

Identifying outliers - Upon examining the box plots, several outliers were identified but retained for their critical information. An exception was made for the LH hormone markers, where an extreme value of around 2000 mIU/mL was deemed an anomaly and removed from the dataset. The box plot for the LH hormone markers is shown below:

Remove the extreme outlier in LH levels.

pcos <- pcos[pcos$LH.mIU.mL. < 2000, ]
print("The outlier in LH (mIU/mL) has been removed")
## [1] "The outlier in LH (mIU/mL) has been removed"

Descriptive Statistics and Visualisation

PCOS is a common endocrine disorder affecting women, characterized by a range of symptoms that can impact both physical appearance and internal hormone regulation. To better understand this condition, we examined a dataset with several physical and physiological variables.

Decsriptive Statistics Cont.

The below histogram visualizes the BMI distribution among individuals with and without PCOS. key observations are:

Decsriptive Statistics Cont.

The below bar chart visualizes the distribution of individuals who reported weight gain, segmented by their PCOS status. Here are the insights based on the graph:

Decsriptive Statistics Cont.

Analysis of Key Variables in PCOS Dataset

For the following analysis, we have chosen to focus on key variables such as BMI, Age, and LH levels to provide a clearer understanding of the dataset. While the dataset contains a wide range of health indicators, these variables were selected due to their relevance in assessing the characteristics associated with PCOS. The summary statistics reveal notable differences between individuals with and without PCOS:

PCOS..Y.N. Min_BMI Q1_BMI Median_BMI Q3_BMI Max_BMI Mean_BMI SD_BMI Missing_BMI Min_Age Q1_Age Median_Age Q3_Age Max_Age Mean_Age SD_Age Missing_Age Min_LH Q1_LH Median_LH Q3_LH Max_LH Mean_LH SD_LH Missing_LH
0 13.38797 21.35897 23.60000 26.09093 38.26531 23.74740 3.759378 0 20 28 32 36 48 32.06593 5.360918 0 0.020 1.03 2.305 3.6025 14.69 2.612676 2.103597 0
1 12.41788 23.00473 25.10194 28.30096 38.90000 25.48439 4.404994 0 21 27 29 33 47 30.11364 5.305376 0 0.032 1.00 2.205 4.3000 14.24 3.018250 2.666775 0

Hypothesis Testing

To investigate the differences in hormonal levels and physical characteristics between individuals with and without PCOS, we focus on Luteinizing Hormone (LH) - to understand hormonal differences.

1. Hypothesis Test for LH Levels

Hypotheses:

\[ H_0: \mu_{PCOS} = \mu_{Non-PCOS} \\ H_A: \mu_{PCOS} \neq \mu_{Non-PCOS} \]

Assumptions

  1. Independence: The observations in each group are assumed to be independent of each other. This means that the data collected from one group should not influence the data collected from the other group.

  2. Normality: It is assumed that the distribution of the dependent variable (LH levels) is approximately normal within each group. Given that the sample sizes are much greater than \(n = 30\), the Central Limit Theorem suggests that the sampling distribution of the means will be approximately normal, even if the original data is not perfectly normal.

  3. Equal Variance: We conducted Levene’s test to assess the homogeneity of variances. This test will help determine if the variances of the two groups are significantly different.

R Code for Levene’s test

levene_result <- leveneTest(LH.mIU.mL. ~ PCOS..Y.N., data = pcos)

# Print the result
print(levene_result)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   1  5.6915 0.01739 *
##       538                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Levene’s Test for Homogeneity of Variance yielded a p-value of 0.01739, which is less than the significance level of 0.05. This indicates a significant difference in variances between the LH levels of individuals with and without PCOS. Therefore, we reject the null hypothesis of equal variances. This suggests that the assumption of equal variances for a standard two-sample t-test is violated, and Welch’s t-test should be used instead.

R Code for Welch’s t-test

# Perform Welch's t-test to compare LH levels
welch_t_test_lh <- t.test(LH.mIU.mL. ~ PCOS..Y.N., data = pcos, conf.level = 0.95)
print(welch_t_test_lh)
## 
##  Welch Two Sample t-test
## 
## data:  LH.mIU.mL. by PCOS..Y.N.
## t = -1.769, df = 283.76, p-value = 0.07797
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.8568579  0.0457095
## sample estimates:
## mean in group 0 mean in group 1 
##        2.612676        3.018250

Inference from Welch’s t-test

The Welch’s t-test results indicate that there is not enough evidence to conclude that there is a significant difference in the mean LH levels between individuals with PCOS (mean ≈ 3.018) and those without PCOS (mean ≈ 2.613). The t-value is approximately -1.769 with 283.76 degrees of freedom, resulting in a p-value of 0.078, which is greater than the common significance level of 0.05.

Furthermore, the 95% confidence interval for the difference in means ranges from approximately -0.857 to 0.046. Since this interval includes zero, it suggests that the difference in means is not statistically significant. Therefore, we fail to reject the null hypothesis, implying that the average LH levels are similar in both groups.

2. Categorical Association

The Chi-squared test is relevant in this analysis as it helps determine whether there is a significant association between weight gain and PCOS status among individuals in the dataset. By evaluating the independence of these categorical variables, the test provides insights into the potential impact of PCOS on weight gain outcomes.

Hypothesis:

\[ H_0: \text{Weight gain and PCOS are independent} \\ H_A: \text{Weight gain and PCOS are not independent} \]

Assumptions for Chi-squared Test

  1. Independence of Observations: The observations must be independent, meaning the outcome of one observation should not affect another. In the analysis, each individual’s weight gain status and PCOS status are treated as separate entities, ensuring that their responses do not influence one another.

  2. Categorical Variables: Both variables being analyzed should be categorical. In this analysis, the variables involved are:

    Weight Gain: Categorical (Yes/No) PCOS Status: Categorical (Yes/No) Since both variables are categorical, this assumption is satisfied.

  3. Sample Size: Each expected frequency in the contingency table should ideally be at least 5 for the test results to be reliable. The expected frequencies can be computed based on the contingency table. If all expected frequencies are greater than or equal to 5, this assumption is satisfied. If not, it may be necessary to consider merging categories or using a different statistical test.

R Code for Chi-squared Test:

# Load required packages
library(dplyr)

# Create a contingency table for weight gain and PCOS status
contingency_table <- table(pcos$Weight.gain.Y.N., pcos$PCOS..Y.N.)

# Print the contingency table
print(contingency_table)
##    
##       0   1
##   0 281  55
##   1  83 121
# Perform the Chi-squared test
chi_squared_test <- chisq.test(contingency_table)
print(chi_squared_test)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contingency_table
## X-squared = 104.61, df = 1, p-value < 2.2e-16

Chi-Squared Test Results Summary

The results of the Chi-squared test indicate that the Chi-squared statistic is 104.61 with 1 degree of freedom and a p-value of less than 2.2e-16. Since the p-value is significantly lower than the conventional alpha level of 0.05, we reject the null hypothesis, indicating a statistically significant association between weight gain and PCOS status. Therefore, it can be concluded that the prevalence of weight gain differs significantly between individuals with and without PCOS in the dataset.

Discussion

Major Findings:

1.Hormonal Imbalances: Although LH levels varied more widely in the PCOS group, Welch’s t-test did not find a statistically significant difference in mean LH levels between the two groups. This indicates that hormonal irregularities may present differently across individuals. 2.Weight Gain Association: A chi-square test confirmed a significant association between weight gain and PCOS, emphasizing the link between the condition and weight management challenges.

Strengths:1.Comprehensive Approach: Our use of statistical techniques, from descriptive analysis to hypothesis testing, enabled a thorough examination of the data. 2.Focus on Key Indicators: Concentrating on BMI, LH levels, and weight gain provided clear insights into the characteristics of PCOS.

Limitations: 1.Data Generalizability: The dataset may not represent all populations, limiting the generalizability of the findings. 2.Potential Bias: Clinical data might have selection bias, over-representing those seeking treatment.

Future Directions: 1.Expand Variables: Including more health indicators, such as insulin levels, could provide a deeper understanding of PCOS. 2.Predictive Models: Machine learning models could be explored to predict PCOS risk, aiding in earlier diagnosis and treatment.

Conclusion: Our findings emphasize the significant link between PCOS and higher BMI, as well as weight gain, highlighting the need for targeted weight management in PCOS treatment plans. A better understanding of these key indicators can improve diagnosis and care strategies, leading to better patient outcomes.

References

[1]“Polycystic ovary syndrome (PCOS),” www.kaggle.com. https://www.kaggle.com/datasets/prasoonkottarathil/polycystic-ovary-syndrome-pcos

[2]“The Link Between PCOS and Diabetes,” www.healthcentral.com, May 21, 2020. https://www.healthcentral.com/condition/polycystic-ovary-syndrome-pcos/link-between-pcos-diabetes

[3]Y. X. Grolemund J. J. Allaire, Garrett, 4.2 Slidy presentation | R Markdown: The Definitive Guide. Available: https://bookdown.org/yihui/rmarkdown/slidy-presentation.html

‌[4]RMIT 2024: Applied Analytics Course Materials

[5]“GraphPad Prism 9 Statistics Guide - The unequal variance Welch t test,” www.graphpad.com. https://www.graphpad.com/guides/prism/latest/statistics/stat_the_unequal_variance_welch_t_t.htm