Problem Statement

Question: What are the significant factors associated with PCOS, and how do health indicators like BMI, insulin levels, and hormone levels vary between women with and without PCOS?
Using statistical techniques, hypothesis testing and regression analysis, to identify key relationships.

Data introduction and sourcing

Source: Open dataset on PCOS, collected from Kaggle.
The dataset includes various health indicators, such as BMI, insulin levels, and hormonal levels.
Sampling Method: The data was collected through clinical observations.

Importing Data

Importing data into the variable ‘pcos’.

pcos <- read.csv("D:/app_ana/pcos.csv")

head(pcos)

The data contains 46 columns with 43 variables which are indicators of PCOS.

Data Description

Important Variables:

Age (yrs): Age of the patient (Numeric)
Weight (Kg): Weight in kilograms (Numeric)
Height (Cm): Height in centimeters (Numeric)
BMI: Body Mass Index, calculated using weight and height (Numeric)
PCOS (Y/N): Indicates whether the individual has PCOS (1 for Yes, 0 for No) (Categorical)
FSH(mIU/mL): Follicle Stimulating Hormone levels (Numeric)
LH(mIU/mL): Luteinizing Hormone levels (Numeric)
TSH (mIU/L): Thyroid-Stimulating Hormone levels (Numeric)

Additional Variables: Additionally, there are other variables like Random Blood Sugar (RBS), Waist-Hip Ratio, Endometrium thickness, Marriage status (years), Pregnancy status (Y/N), Number of abortions, Blood Group, and various lifestyle indicators such as weight gain, hair growth, skin darkening, pimples, fast food consumption, and regular exercise.

Data Preprocessing

Inspecting the dataset by checking their datatypes

data_types <- sapply(pcos, class)
data_type_count <- table(data_types)
# Print the count of each data type
print(data_type_count)

## data_types
## character   integer   numeric 
##         3        24        18

It is observed that categorical columns which can be stored as factors are stored as integers therefore converting them to factors with levels “0” and “1”.

pcos$PCOS..Y.N. <- as.factor(pcos$PCOS..Y.N.)
pcos$Cycle.R.I. <- as.factor(pcos$Cycle.R.I.)
pcos$Pregnant.Y.N. <- as.factor(pcos$Pregnant.Y.N.)
pcos$Weight.gain.Y.N. <- as.factor(pcos$Weight.gain.Y.N.)
pcos$hair.growth.Y.N. <- as.factor(pcos$hair.growth.Y.N.)
pcos$Skin.darkening..Y.N. <- as.factor(pcos$Skin.darkening..Y.N.)
pcos$Hair.loss.Y.N. <- as.factor(pcos$Hair.loss.Y.N.)
pcos$Pimples.Y.N. <- as.factor(pcos$Pimples.Y.N.)
pcos$Fast.food..Y.N. <- as.factor(pcos$Fast.food..Y.N.)
pcos$Reg.Exercise.Y.N. <- as.factor(pcos$Reg.Exercise.Y.N.)

“1” signifying “Yes” and “0” signifying “No”

Data Preprocessing:

On investigation it was found that there were 2 columns with missing values.

## [1] "Marraige.Status..Yrs." "Fast.food..Y.N."

Replacing the missing values in the Marriage.Status..Yrs. column with the mean value, calculated separately for each category of PCOS..Y.N., and filling in the missing values in the Fast.food..Y.N. column with the mode, also determined by the corresponding PCOS..Y.N. category.

# Load necessary package for calculating mode
get_mode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Replace missing values in Marraige.Status..Yrs. with mean based on PCOS..Y.N.
pcos$Marraige.Status..Yrs.[is.na(pcos$Marraige.Status..Yrs.) & pcos$PCOS..Y.N. == 1] <- 
  mean(pcos$Marraige.Status..Yrs.[pcos$PCOS..Y.N. == 1], na.rm = TRUE)

pcos$Marraige.Status..Yrs.[is.na(pcos$Marraige.Status..Yrs.) & pcos$PCOS..Y.N. == 0] <- 
  mean(pcos$Marraige.Status..Yrs.[pcos$PCOS..Y.N. == 0], na.rm = TRUE)

# Replace missing values in Fast.food..Y.N. with mode based on PCOS..Y.N.
pcos$Fast.food..Y.N.[is.na(pcos$Fast.food..Y.N.) & pcos$PCOS..Y.N. == 1] <- 
  get_mode(pcos$Fast.food..Y.N.[pcos$PCOS..Y.N. == 1])

pcos$Fast.food..Y.N.[is.na(pcos$Fast.food..Y.N.) & pcos$PCOS..Y.N. == 0] <- 
  get_mode(pcos$Fast.food..Y.N.[pcos$PCOS..Y.N. == 0])

Identifying outliers - Upon examining the box plots, several outliers were identified but retained for their critical information. An exception was made for the LH hormone markers, where an extreme value of around 2000 mIU/mL was deemed an anomaly and removed from the dataset. The box plot for the LH hormone markers is shown below:

Remove the extreme outlier in LH levels.

pcos <- pcos[pcos$LH.mIU.mL. < 2000, ]
print("The outlier in LH (mIU/mL) has been removed")

## [1] "The outlier in LH (mIU/mL) has been removed"

Descriptive Statistics and Visualisation

PCOS is a common endocrine disorder affecting women, characterized by a range of symptoms that can impact both physical appearance and internal hormone regulation. To better understand this condition, we examined a dataset with several physical and physiological variables.

The bar plot illustrates the distribution of PCOS cases, showing the count of individuals diagnosed with and without PCOS in the dataset.

Decsriptive Statistics Cont.

The below histogram visualizes the BMI distribution among individuals with and without PCOS. key observations are:

Higher Prevalence of Normal BMI in Non-PCOS Individuals: The majority of individuals without PCOS have a BMI ranging between 20 and 30, indicating that most fall within the normal to overweight category. This range shows a significantly higher count of non-PCOS individuals compared to those with PCOS.
Greater BMI Variation Among PCOS Individuals: While the BMI range for individuals without PCOS extends up to approximately 33, individuals with PCOS show a broader distribution, with BMI values reaching close to 40. This suggests a tendency towards higher BMI, with more occurrences of obesity among individuals diagnosed with PCOS.

Decsriptive Statistics Cont.

The below bar chart visualizes the distribution of individuals who reported weight gain, segmented by their PCOS status. Here are the insights based on the graph:

Higher Proportion of Weight Gain in PCOS Individuals: The chart reveals that a larger proportion of individuals diagnosed with PCOS reported experiencing weight gain (represented by the blue bars). This suggests a stronger association between PCOS and weight gain, which is a known symptom of the condition. On the other hand, the group without PCOS (pink bars) shows a much lower count of individuals reporting weight gain, indicating that weight gain is less common in those who do not have PCOS.

Decsriptive Statistics Cont.

Analysis of Key Variables in PCOS Dataset

For the following analysis, we have chosen to focus on key variables such as BMI, Age, and LH levels to provide a clearer understanding of the dataset. While the dataset contains a wide range of health indicators, these variables were selected due to their relevance in assessing the characteristics associated with PCOS. The summary statistics reveal notable differences between individuals with and without PCOS:

BMI: Individuals with PCOS have both higher median and mean BMI values compared to those without PCOS, suggesting a greater prevalence of weight-related issues. Furthermore, the upper quartile (Q3) and maximum BMI values are noticeably higher in the PCOS group, indicating that extreme cases of higher BMI are more common among those diagnosed with PCOS.
Age: The age distribution between the two groups shows that individuals with PCOS tend to be slightly younger on average, with a lower median age. This aligns with typical age patterns observed in PCOS diagnosis, which often occurs in younger women of reproductive age.
LH Levels: There is a significant variation in LH levels between the two groups. The higher mean and standard deviation of LH levels in the PCOS group suggest a greater degree of hormonal imbalance, which is a known characteristic of PCOS. Notably, the extreme maximum value of LH levels found in the PCOS group highlights cases of severe hormonal irregularity, further differentiating it from the non-PCOS group.

PCOS..Y.N.	Min_BMI	Q1_BMI	Median_BMI	Q3_BMI	Max_BMI	Mean_BMI	SD_BMI	Missing_BMI	Min_Age	Q1_Age	Median_Age	Q3_Age	Max_Age	Mean_Age	SD_Age	Missing_Age	Min_LH	Q1_LH	Median_LH	Q3_LH	Max_LH	Mean_LH	SD_LH	Missing_LH
0	13.38797	21.35897	23.60000	26.09093	38.26531	23.74740	3.759378	0	20	28	32	36	48	32.06593	5.360918	0	0.020	1.03	2.305	3.6025	14.69	2.612676	2.103597	0
1	12.41788	23.00473	25.10194	28.30096	38.90000	25.48439	4.404994	0	21	27	29	33	47	30.11364	5.305376	0	0.032	1.00	2.205	4.3000	14.24	3.018250	2.666775	0

Hypothesis Testing

To investigate the differences in hormonal levels and physical characteristics between individuals with and without PCOS, we focus on Luteinizing Hormone (LH) - to understand hormonal differences.

1. Hypothesis Test for LH Levels

Hypotheses:

Null Hypothesis (\(H_0\)): The average LH levels (\(\mu_{LH}\)) are the same for individuals with and without PCOS.
Alternative Hypothesis (\(H_A\)): The average LH levels (\(\mu_{LH}\)) are different for individuals with and without PCOS.

\[ H_0: \mu_{PCOS} = \mu_{Non-PCOS} \\ H_A: \mu_{PCOS} \neq \mu_{Non-PCOS} \]

Assumptions

Independence: The observations in each group are assumed to be independent of each other. This means that the data collected from one group should not influence the data collected from the other group.
Normality: It is assumed that the distribution of the dependent variable (LH levels) is approximately normal within each group. Given that the sample sizes are much greater than \(n = 30\), the Central Limit Theorem suggests that the sampling distribution of the means will be approximately normal, even if the original data is not perfectly normal.
Equal Variance: We conducted Levene’s test to assess the homogeneity of variances. This test will help determine if the variances of the two groups are significantly different.

R Code for Levene’s test

levene_result <- leveneTest(LH.mIU.mL. ~ PCOS..Y.N., data = pcos)

# Print the result
print(levene_result)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   1  5.6915 0.01739 *
##       538                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Levene’s Test for Homogeneity of Variance yielded a p-value of 0.01739, which is less than the significance level of 0.05. This indicates a significant difference in variances between the LH levels of individuals with and without PCOS. Therefore, we reject the null hypothesis of equal variances. This suggests that the assumption of equal variances for a standard two-sample t-test is violated, and Welch’s t-test should be used instead.

R Code for Welch’s t-test

# Perform Welch's t-test to compare LH levels
welch_t_test_lh <- t.test(LH.mIU.mL. ~ PCOS..Y.N., data = pcos, conf.level = 0.95)
print(welch_t_test_lh)

## 
##  Welch Two Sample t-test
## 
## data:  LH.mIU.mL. by PCOS..Y.N.
## t = -1.769, df = 283.76, p-value = 0.07797
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.8568579  0.0457095
## sample estimates:
## mean in group 0 mean in group 1 
##        2.612676        3.018250

Inference from Welch’s t-test

The Welch’s t-test results indicate that there is not enough evidence to conclude that there is a significant difference in the mean LH levels between individuals with PCOS (mean ≈ 3.018) and those without PCOS (mean ≈ 2.613). The t-value is approximately -1.769 with 283.76 degrees of freedom, resulting in a p-value of 0.078, which is greater than the common significance level of 0.05.

Furthermore, the 95% confidence interval for the difference in means ranges from approximately -0.857 to 0.046. Since this interval includes zero, it suggests that the difference in means is not statistically significant. Therefore, we fail to reject the null hypothesis, implying that the average LH levels are similar in both groups.

2. Categorical Association

The Chi-squared test is relevant in this analysis as it helps determine whether there is a significant association between weight gain and PCOS status among individuals in the dataset. By evaluating the independence of these categorical variables, the test provides insights into the potential impact of PCOS on weight gain outcomes.

Hypothesis:

Null Hypothesis (\(H_0\)): There is no association between weight gain and PCOS status.
Alternative Hypothesis (\(H_A\)): There is an association between weight gain and PCOS status.

\[ H_0: \text{Weight gain and PCOS are independent} \\ H_A: \text{Weight gain and PCOS are not independent} \]

Assumptions for Chi-squared Test

Independence of Observations: The observations must be independent, meaning the outcome of one observation should not affect another. In the analysis, each individual’s weight gain status and PCOS status are treated as separate entities, ensuring that their responses do not influence one another.
Categorical Variables: Both variables being analyzed should be categorical. In this analysis, the variables involved are:

Weight Gain: Categorical (Yes/No) PCOS Status: Categorical (Yes/No) Since both variables are categorical, this assumption is satisfied.
Sample Size: Each expected frequency in the contingency table should ideally be at least 5 for the test results to be reliable. The expected frequencies can be computed based on the contingency table. If all expected frequencies are greater than or equal to 5, this assumption is satisfied. If not, it may be necessary to consider merging categories or using a different statistical test.

R Code for Chi-squared Test:

# Load required packages
library(dplyr)

# Create a contingency table for weight gain and PCOS status
contingency_table <- table(pcos$Weight.gain.Y.N., pcos$PCOS..Y.N.)

# Print the contingency table
print(contingency_table)

##    
##       0   1
##   0 281  55
##   1  83 121

# Perform the Chi-squared test
chi_squared_test <- chisq.test(contingency_table)
print(chi_squared_test)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contingency_table
## X-squared = 104.61, df = 1, p-value < 2.2e-16

Chi-Squared Test Results Summary

The results of the Chi-squared test indicate that the Chi-squared statistic is 104.61 with 1 degree of freedom and a p-value of less than 2.2e-16. Since the p-value is significantly lower than the conventional alpha level of 0.05, we reject the null hypothesis, indicating a statistically significant association between weight gain and PCOS status. Therefore, it can be concluded that the prevalence of weight gain differs significantly between individuals with and without PCOS in the dataset.

Discussion

Major Findings:

1.Hormonal Imbalances: Although LH levels varied more widely in the PCOS group, Welch’s t-test did not find a statistically significant difference in mean LH levels between the two groups. This indicates that hormonal irregularities may present differently across individuals. 2.Weight Gain Association: A chi-square test confirmed a significant association between weight gain and PCOS, emphasizing the link between the condition and weight management challenges.

Strengths:1.Comprehensive Approach: Our use of statistical techniques, from descriptive analysis to hypothesis testing, enabled a thorough examination of the data. 2.Focus on Key Indicators: Concentrating on BMI, LH levels, and weight gain provided clear insights into the characteristics of PCOS.

Limitations: 1.Data Generalizability: The dataset may not represent all populations, limiting the generalizability of the findings. 2.Potential Bias: Clinical data might have selection bias, over-representing those seeking treatment.

Future Directions: 1.Expand Variables: Including more health indicators, such as insulin levels, could provide a deeper understanding of PCOS. 2.Predictive Models: Machine learning models could be explored to predict PCOS risk, aiding in earlier diagnosis and treatment.

Conclusion: Our findings emphasize the significant link between PCOS and higher BMI, as well as weight gain, highlighting the need for targeted weight management in PCOS treatment plans. A better understanding of these key indicators can improve diagnosis and care strategies, leading to better patient outcomes.

References

[1]“Polycystic ovary syndrome (PCOS),” www.kaggle.com. https://www.kaggle.com/datasets/prasoonkottarathil/polycystic-ovary-syndrome-pcos

[2]“The Link Between PCOS and Diabetes,” www.healthcentral.com, May 21, 2020. https://www.healthcentral.com/condition/polycystic-ovary-syndrome-pcos/link-between-pcos-diabetes

[3]Y. X. Grolemund J. J. Allaire, Garrett, 4.2 Slidy presentation | R Markdown: The Definitive Guide. Available: https://bookdown.org/yihui/rmarkdown/slidy-presentation.html

‌[4]RMIT 2024: Applied Analytics Course Materials

[5]“GraphPad Prism 9 Statistics Guide - The unequal variance Welch t test,” www.graphpad.com. https://www.graphpad.com/guides/prism/latest/statistics/stat_the_unequal_variance_welch_t_t.htm

‌

Statistical Analysis of Polycystic Ovary Syndrome (PCOS) Indicators

Exploring Hormonal and Lifestyle Factors Associated with PCOS

RPubs link information

Introduction

Problem Statement

Importing Data

Data Description

Data Preprocessing

Data Preprocessing:

Descriptive Statistics and Visualisation

Decsriptive Statistics Cont.

Decsriptive Statistics Cont.

Decsriptive Statistics Cont.

Analysis of Key Variables in PCOS Dataset

Hypothesis Testing

1. Hypothesis Test for LH Levels

Hypotheses:

Assumptions

R Code for Levene’s test

Inference from Welch’s t-test

2. Categorical Association

Hypothesis:

Assumptions for Chi-squared Test

R Code for Chi-squared Test:

Chi-Squared Test Results Summary

Discussion

Major Findings:

References