Prerana Ramchandra: s4058630 and Chandangowda Maruvanahalli Shivaramu: s4063920
Last updated: 18 October, 2024
Data introduction and sourcing
Important Variables:
Age (yrs): Age of the patient
(Numeric)Weight (Kg): Weight in kilograms
(Numeric)Height (Cm): Height in centimeters
(Numeric)BMI: Body Mass Index, calculated using
weight and height (Numeric)PCOS (Y/N): Indicates whether the
individual has PCOS (1 for Yes, 0 for No) (Categorical)FSH(mIU/mL): Follicle Stimulating
Hormone levels (Numeric)LH(mIU/mL): Luteinizing Hormone levels
(Numeric)TSH (mIU/L): Thyroid-Stimulating
Hormone levels (Numeric)Additional Variables: Additionally, there are other variables like Random Blood Sugar (RBS), Waist-Hip Ratio, Endometrium thickness, Marriage status (years), Pregnancy status (Y/N), Number of abortions, Blood Group, and various lifestyle indicators such as weight gain, hair growth, skin darkening, pimples, fast food consumption, and regular exercise.
Inspecting the dataset by checking their datatypes
data_types <- sapply(pcos, class)
data_type_count <- table(data_types)
# Print the count of each data type
print(data_type_count)## data_types
## character integer numeric
## 3 24 18
pcos$PCOS..Y.N. <- as.factor(pcos$PCOS..Y.N.)
pcos$Cycle.R.I. <- as.factor(pcos$Cycle.R.I.)
pcos$Pregnant.Y.N. <- as.factor(pcos$Pregnant.Y.N.)
pcos$Weight.gain.Y.N. <- as.factor(pcos$Weight.gain.Y.N.)
pcos$hair.growth.Y.N. <- as.factor(pcos$hair.growth.Y.N.)
pcos$Skin.darkening..Y.N. <- as.factor(pcos$Skin.darkening..Y.N.)
pcos$Hair.loss.Y.N. <- as.factor(pcos$Hair.loss.Y.N.)
pcos$Pimples.Y.N. <- as.factor(pcos$Pimples.Y.N.)
pcos$Fast.food..Y.N. <- as.factor(pcos$Fast.food..Y.N.)
pcos$Reg.Exercise.Y.N. <- as.factor(pcos$Reg.Exercise.Y.N.)“1” signifying “Yes” and “0” signifying “No”
## [1] "Marraige.Status..Yrs." "Fast.food..Y.N."
# Load necessary package for calculating mode
get_mode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
# Replace missing values in Marraige.Status..Yrs. with mean based on PCOS..Y.N.
pcos$Marraige.Status..Yrs.[is.na(pcos$Marraige.Status..Yrs.) & pcos$PCOS..Y.N. == 1] <-
mean(pcos$Marraige.Status..Yrs.[pcos$PCOS..Y.N. == 1], na.rm = TRUE)
pcos$Marraige.Status..Yrs.[is.na(pcos$Marraige.Status..Yrs.) & pcos$PCOS..Y.N. == 0] <-
mean(pcos$Marraige.Status..Yrs.[pcos$PCOS..Y.N. == 0], na.rm = TRUE)
# Replace missing values in Fast.food..Y.N. with mode based on PCOS..Y.N.
pcos$Fast.food..Y.N.[is.na(pcos$Fast.food..Y.N.) & pcos$PCOS..Y.N. == 1] <-
get_mode(pcos$Fast.food..Y.N.[pcos$PCOS..Y.N. == 1])
pcos$Fast.food..Y.N.[is.na(pcos$Fast.food..Y.N.) & pcos$PCOS..Y.N. == 0] <-
get_mode(pcos$Fast.food..Y.N.[pcos$PCOS..Y.N. == 0])Identifying outliers - Upon examining the box plots,
several outliers were identified but retained for their critical
information. An exception was made for the LH hormone
markers, where an extreme value of around 2000 mIU/mL was deemed an
anomaly and removed from the dataset. The box plot for the
LH hormone markers is shown below:
Remove the extreme outlier in LH levels.
## [1] "The outlier in LH (mIU/mL) has been removed"
PCOS is a common endocrine disorder affecting women, characterized by a range of symptoms that can impact both physical appearance and internal hormone regulation. To better understand this condition, we examined a dataset with several physical and physiological variables.
The below histogram visualizes the BMI distribution among individuals with and without PCOS. key observations are:
The below bar chart visualizes the distribution of individuals who reported weight gain, segmented by their PCOS status. Here are the insights based on the graph:
For the following analysis, we have chosen to focus on key variables such as BMI, Age, and LH levels to provide a clearer understanding of the dataset. While the dataset contains a wide range of health indicators, these variables were selected due to their relevance in assessing the characteristics associated with PCOS. The summary statistics reveal notable differences between individuals with and without PCOS:
BMI: Individuals with PCOS have both higher median and mean BMI values compared to those without PCOS, suggesting a greater prevalence of weight-related issues. Furthermore, the upper quartile (Q3) and maximum BMI values are noticeably higher in the PCOS group, indicating that extreme cases of higher BMI are more common among those diagnosed with PCOS.
Age: The age distribution between the two groups shows that individuals with PCOS tend to be slightly younger on average, with a lower median age. This aligns with typical age patterns observed in PCOS diagnosis, which often occurs in younger women of reproductive age.
LH Levels: There is a significant variation in LH levels between the two groups. The higher mean and standard deviation of LH levels in the PCOS group suggest a greater degree of hormonal imbalance, which is a known characteristic of PCOS. Notably, the extreme maximum value of LH levels found in the PCOS group highlights cases of severe hormonal irregularity, further differentiating it from the non-PCOS group.
| PCOS..Y.N. | Min_BMI | Q1_BMI | Median_BMI | Q3_BMI | Max_BMI | Mean_BMI | SD_BMI | Missing_BMI | Min_Age | Q1_Age | Median_Age | Q3_Age | Max_Age | Mean_Age | SD_Age | Missing_Age | Min_LH | Q1_LH | Median_LH | Q3_LH | Max_LH | Mean_LH | SD_LH | Missing_LH |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13.38797 | 21.35897 | 23.60000 | 26.09093 | 38.26531 | 23.74740 | 3.759378 | 0 | 20 | 28 | 32 | 36 | 48 | 32.06593 | 5.360918 | 0 | 0.020 | 1.03 | 2.305 | 3.6025 | 14.69 | 2.612676 | 2.103597 | 0 |
| 1 | 12.41788 | 23.00473 | 25.10194 | 28.30096 | 38.90000 | 25.48439 | 4.404994 | 0 | 21 | 27 | 29 | 33 | 47 | 30.11364 | 5.305376 | 0 | 0.032 | 1.00 | 2.205 | 4.3000 | 14.24 | 3.018250 | 2.666775 | 0 |
To investigate the differences in hormonal levels and physical characteristics between individuals with and without PCOS, we focus on Luteinizing Hormone (LH) - to understand hormonal differences.
\[ H_0: \mu_{PCOS} = \mu_{Non-PCOS} \\ H_A: \mu_{PCOS} \neq \mu_{Non-PCOS} \]
Independence: The observations in each group are assumed to be independent of each other. This means that the data collected from one group should not influence the data collected from the other group.
Normality: It is assumed that the distribution of the dependent variable (LH levels) is approximately normal within each group. Given that the sample sizes are much greater than \(n = 30\), the Central Limit Theorem suggests that the sampling distribution of the means will be approximately normal, even if the original data is not perfectly normal.
Equal Variance: We conducted Levene’s test to assess the homogeneity of variances. This test will help determine if the variances of the two groups are significantly different.
levene_result <- leveneTest(LH.mIU.mL. ~ PCOS..Y.N., data = pcos)
# Print the result
print(levene_result)## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 5.6915 0.01739 *
## 538
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Levene’s Test for Homogeneity of Variance yielded a p-value of 0.01739, which is less than the significance level of 0.05. This indicates a significant difference in variances between the LH levels of individuals with and without PCOS. Therefore, we reject the null hypothesis of equal variances. This suggests that the assumption of equal variances for a standard two-sample t-test is violated, and Welch’s t-test should be used instead.
R Code for Welch’s t-test
# Perform Welch's t-test to compare LH levels
welch_t_test_lh <- t.test(LH.mIU.mL. ~ PCOS..Y.N., data = pcos, conf.level = 0.95)
print(welch_t_test_lh)##
## Welch Two Sample t-test
##
## data: LH.mIU.mL. by PCOS..Y.N.
## t = -1.769, df = 283.76, p-value = 0.07797
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -0.8568579 0.0457095
## sample estimates:
## mean in group 0 mean in group 1
## 2.612676 3.018250
The Welch’s t-test results indicate that there is not enough evidence to conclude that there is a significant difference in the mean LH levels between individuals with PCOS (mean ≈ 3.018) and those without PCOS (mean ≈ 2.613). The t-value is approximately -1.769 with 283.76 degrees of freedom, resulting in a p-value of 0.078, which is greater than the common significance level of 0.05.
Furthermore, the 95% confidence interval for the difference in means ranges from approximately -0.857 to 0.046. Since this interval includes zero, it suggests that the difference in means is not statistically significant. Therefore, we fail to reject the null hypothesis, implying that the average LH levels are similar in both groups.
The Chi-squared test is relevant in this analysis as it helps determine whether there is a significant association between weight gain and PCOS status among individuals in the dataset. By evaluating the independence of these categorical variables, the test provides insights into the potential impact of PCOS on weight gain outcomes.
\[ H_0: \text{Weight gain and PCOS are independent} \\ H_A: \text{Weight gain and PCOS are not independent} \]
Independence of Observations: The observations must be independent, meaning the outcome of one observation should not affect another. In the analysis, each individual’s weight gain status and PCOS status are treated as separate entities, ensuring that their responses do not influence one another.
Categorical Variables: Both variables being analyzed should be categorical. In this analysis, the variables involved are:
Weight Gain: Categorical (Yes/No) PCOS Status: Categorical (Yes/No) Since both variables are categorical, this assumption is satisfied.
Sample Size: Each expected frequency in the contingency table should ideally be at least 5 for the test results to be reliable. The expected frequencies can be computed based on the contingency table. If all expected frequencies are greater than or equal to 5, this assumption is satisfied. If not, it may be necessary to consider merging categories or using a different statistical test.
# Load required packages
library(dplyr)
# Create a contingency table for weight gain and PCOS status
contingency_table <- table(pcos$Weight.gain.Y.N., pcos$PCOS..Y.N.)
# Print the contingency table
print(contingency_table)##
## 0 1
## 0 281 55
## 1 83 121
# Perform the Chi-squared test
chi_squared_test <- chisq.test(contingency_table)
print(chi_squared_test)##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: contingency_table
## X-squared = 104.61, df = 1, p-value < 2.2e-16
The results of the Chi-squared test indicate that the Chi-squared statistic is 104.61 with 1 degree of freedom and a p-value of less than 2.2e-16. Since the p-value is significantly lower than the conventional alpha level of 0.05, we reject the null hypothesis, indicating a statistically significant association between weight gain and PCOS status. Therefore, it can be concluded that the prevalence of weight gain differs significantly between individuals with and without PCOS in the dataset.
1.Hormonal Imbalances: Although LH levels varied more widely in the PCOS group, Welch’s t-test did not find a statistically significant difference in mean LH levels between the two groups. This indicates that hormonal irregularities may present differently across individuals. 2.Weight Gain Association: A chi-square test confirmed a significant association between weight gain and PCOS, emphasizing the link between the condition and weight management challenges.
Strengths:1.Comprehensive Approach: Our use of statistical techniques, from descriptive analysis to hypothesis testing, enabled a thorough examination of the data. 2.Focus on Key Indicators: Concentrating on BMI, LH levels, and weight gain provided clear insights into the characteristics of PCOS.
Limitations: 1.Data Generalizability: The dataset may not represent all populations, limiting the generalizability of the findings. 2.Potential Bias: Clinical data might have selection bias, over-representing those seeking treatment.
Future Directions: 1.Expand Variables: Including more health indicators, such as insulin levels, could provide a deeper understanding of PCOS. 2.Predictive Models: Machine learning models could be explored to predict PCOS risk, aiding in earlier diagnosis and treatment.
Conclusion: Our findings emphasize the significant link between PCOS and higher BMI, as well as weight gain, highlighting the need for targeted weight management in PCOS treatment plans. A better understanding of these key indicators can improve diagnosis and care strategies, leading to better patient outcomes.
[1]“Polycystic ovary syndrome (PCOS),” www.kaggle.com. https://www.kaggle.com/datasets/prasoonkottarathil/polycystic-ovary-syndrome-pcos
[2]“The Link Between PCOS and Diabetes,” www.healthcentral.com, May 21, 2020. https://www.healthcentral.com/condition/polycystic-ovary-syndrome-pcos/link-between-pcos-diabetes
[3]Y. X. Grolemund J. J. Allaire, Garrett, 4.2 Slidy presentation | R Markdown: The Definitive Guide. Available: https://bookdown.org/yihui/rmarkdown/slidy-presentation.html
[4]RMIT 2024: Applied Analytics Course Materials
[5]“GraphPad Prism 9 Statistics Guide - The unequal variance Welch t test,” www.graphpad.com. https://www.graphpad.com/guides/prism/latest/statistics/stat_the_unequal_variance_welch_t_t.htm