Presentation Midterm

Purpose

To understand what factors affect women’s fertility and Menstrual Cycle!
Uses Data collected from a Marquette University Study for 1665 survey responses of various factors including:
Client’s Age, height, weight, BMI, Bleeding Intensity, Medication they take etc.

Medication Data Cleaning 1

#Medication/pie chart
fertile_data <- read.csv("FedCycleData071012 (1).csv", sep = ",", header = TRUE)


#Cleaning the NA values of the raw data file

fertile_Med <-fertile_data[!is.na(fertile_data$Medvitexplain) & trimws(fertile_data$Medvitexplain) != "",]


fertile_Med <- fertile_Med%>%
  select(ClientID, Medvitexplain, Medvits, MedvitsM, everything())


#seperating each medicine in its own row, independent of client 
fertile_Med2 <- fertile_Med%>%
  separate_rows(Medvitexplain, sep = ",|and|;")%>%
  mutate(Medvitexplain = str_trim(Medvitexplain))

Medication Data Cleaning 2

Categorizing Medicine:

fertile_Med2 <- fertile_Med%>%
  mutate(category = case_when(
    str_detect(Medvitexplain,regex("vit|iron|oil|B12|B6|optivite|calcium|one-a-day|folic Acid|supplement", ignore_case = TRUE)) ~ "Supplements",
    str_detect(Medvitexplain,regex( "pain|head|migrane|migraine|aspirin|neurontin|flexeril|adderall|adderal", ignore_case=TRUE)) ~ "Pain Medicine",
    str_detect(Medvitexplain, regex( "lexapro|equate|effexor|seroquel|paxil|citalopram|wellbutrin|welbutrin|zoloft|fluoxetine|cymbalta|Cybalta|provigil|celexa|effexor", ignore_case= TRUE)) ~ "Anti-Depressents",
    str_detect(Medvitexplain, regex("thyroid|levoxyl|thyro|synthroid|syntthroid|levothroid|levothyroxine", ignore_case = TRUE)) ~ "Thyroid-related mediciation",
    str_detect(Medvitexplain, regex("insulin|biotic|reglan|antioxidant|greens|hydrochlorothiazide| alpha|minocyclen|humalog|cream|medroxyprogest|detrol|prometrium", ignore_case = TRUE)) ~ "Metabolic/Hormonal",
    TRUE ~ "Other"
  ))%>%
  select(Medvitexplain, category, everything())

fertile_check <- fertile_Med2%>%
  filter(fertile_Med2$category == "Other")

fertile_check <- fertile_check[!is.na(fertile_check$Medvitexplain) & trimws(fertile_check$Medvitexplain) != "",]

Medication Data Cleaning 3

#counting number of women taking medicine 

fertile_countMed <- fertile_Med2%>%
  group_by(category)%>%
  summarise(n = n_distinct(ClientID))

Medication proportions pie

Observations

Method:

Created a new column called “category” to categorize the different medicines women take. I manually searched up the ones I didn’t know and created a key using the case_when() function.

The “other” category includes special circumstance medicines relating to blood pressure, different allergy medicines, pulmonary medicine etc. There were too few of each to make into their own categories.

As expected a significantly large of women take supplements for their period/fertility at 72.5 percent. However what is even more interesting is that women are just as likely to experience a plethora of problems to take medications for such as anti-depressants, pain medicine (also expected), thyroid related medicine (a larger proportion than the pulmonary, or blood pressure related medicine in the “other” category, so I included it separately).

Data Cleaning: BMI, Mean Cycle Length and Bleeding Intensity

fertile_BMI<- fertile_data[!is.na(fertile_data$BMI) & trimws(fertile_data$BMI) != "",]

fertile_BMI <- fertile_BMI%>%
  select(ClientID, BMI, everything())

3D Plot: BMI, Mean Cycle Length and Mean Bleeding Intensity

Statistical Anlaysis: 3D Plot and Observations

## 
## Call:
## lm(formula = MeanBleedingIntensity ~ BMI + MeanCycleLength, data = fertile_BMI)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8552 -2.2497  0.0444  1.5190  8.2119 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)  
## (Intercept)      4.90486    2.56907   1.909   0.0592 .
## BMI             -0.02247    0.04271  -0.526   0.6000  
## MeanCycleLength  0.19046    0.07990   2.384   0.0191 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.572 on 98 degrees of freedom
##   (30 observations deleted due to missingness)
## Multiple R-squared:  0.05674,    Adjusted R-squared:  0.03749 
## F-statistic: 2.947 on 2 and 98 DF,  p-value: 0.05715

Observations for 3D plot

From First glance it appears as though there is a 3 way relationship between BMI, cycle length and bleeding intensity. A lot of the datapoints are clustered in a corner, where lower BMI and lower cycle length shows a lower mean bleeding intensity.

However after conducting a linear regression analysis this doesn’t hold through to a statistical significance. Although we see that there is a stronger relationship between mean cycle length and bleeding intensity with a 0.019 p value. This means that for every 1 unit increase in cycle length, bleeding intensity increases by about 0.19. However BMI doesn’t play a part in bleeding intensity in a statistically significant way. We can see this by observing the r squared (0.056), which is really low and the F statistic p value which is barely at 0.05.

Checking for a 2 way linear relationship: Mean Cycle Lenth and Mean Bleeding Intensity:

## 
## Call:
## lm(formula = MeanBleedingIntensity ~ MeanCycleLength, data = fertile_BMI)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8446 -2.2493  0.1497  1.5759  8.0492 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)  
## (Intercept)      4.36603    2.34747   1.860   0.0659 .
## MeanCycleLength  0.18931    0.07958   2.379   0.0193 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.563 on 99 degrees of freedom
##   (30 observations deleted due to missingness)
## Multiple R-squared:  0.05407,    Adjusted R-squared:  0.04452 
## F-statistic: 5.659 on 1 and 99 DF,  p-value: 0.01928

2D graph and Best fit line

Observations for 2D graph

This confirms that although it is not strong there is a stronger relationship between Mean Cycle Length and bleeding intensity, although r squared is still low, meaning we can’t confirm that there necessarily a strong relationship.

Data Cleaning: Relationship between Mean Bleeding Intensity and Abortion

Cleaned all the NA for the corresponding variables

#reading the chosen csv file dowloaded from kaggle
fertile_data <- read.csv("FedCycleData071012 (1).csv", sep = ",", header = TRUE)

#Cleaning the blank or NA cells 
fertile_clean_jes <- fertile_data[!is.na(fertile_data$Abortions) & !is.na(fertile_data$MeanBleedingIntensity), ]

#selecting the columns 
fertile_clean_jes <- fertile_clean_jes %>%
  select(Abortions, MeanBleedingIntensity, everything())


#grouping data 
fertile_group_jes <- fertile_clean_jes %>%
  group_by(ClientID) %>%
  summarise(Abortions = mean(Abortions, na.rm = TRUE), 
          MeanBleedingIntensity = mean(MeanBleedingIntensity, na.rm = TRUE)) %>%
            ungroup()

Box plot for distribution of mean bleeding intensity based on abortion

Statistical Analysis

ANOVA Test

ANOVA test applied to determine if there is a significance difference between the average mean bleeding intensity between the different abortion number levels.

##             Df Sum Sq Mean Sq F value Pr(>F)  
## Abortions    2   40.6  20.289   3.582 0.0315 *
## Residuals   99  560.8   5.665                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Data Interpretation

The box plot and ANOVA test reveals that there is a statistical difference between the mean values of mean bleeding intensity and abortion numbers (p < 0.05). While there is a skew in the data availability between abortion numbers, the overall mean bleeding intensity trend lower for women who have experienced abortion compared to those who have not. As observed from the box plot of 0 abortion subjects, the mean bleeding intensity usually varies between the subject without a specific pattern. Thus, the overall lower trend of mean bleeding intensity for subjects who have experienced abortion reveals an interesting correlation between abortion and mean bleeding intensity (this does not conclude that abortion causes change to bleeding intensity but rather an intriguing relationship that can be invested further).

Data Cleaning

Comparing the Correlation between Miscarriage and Weight vs. Miscarriage and Income level

cleaned all the NA for the corresponding variables

#Cleaning the data 
fertile_clean_jes2 <- fertile_data[!is.na(fertile_data$Miscarriages) & !is.na(fertile_data$IncomeM) & !is.na(fertile_data$Weight), ] %>%
  group_by(ClientID) %>%
  summarise(Miscarriages = first(Miscarriages),
             IncomeM = first(IncomeM),
             Weight = mean(Weight, na.rm = TRUE)) %>%
  ungroup() %>%
  filter(Weight >= 80 & Weight <= 400)

Histogram for Miscarriages and Weight

Histogram for the distribution of Miscarriage Occurrence based on Weight

Bar Graph for Miscarriages and Income Level

- Bar Graph for distribution of Miscarriage Occurrence based on Income level

Statistical Analysis

Chi-Square

Chi-square used to analyze the statistical difference of the two categorical (level) variables

## 
##  Pearson's Chi-squared test
## 
## data:  weight_chi
## X-squared = 15.014, df = 18, p-value = 0.661

## 
##  Pearson's Chi-squared test
## 
## data:  income_chi
## X-squared = 8.3396, df = 12, p-value = 0.7581

Data Interpretation

The histogram and box plot along with chi-square test of both graphs reflect an inability to reject the null hypothesis that there is no correlation between weight and miscarriage & income and miscarriage. While this cannot be a definite conclusion, the p value that is significantly above p-value = 0.05 explains that the correlation relationship between the two values (weight and income) and miscarriage is not statistically significant and could occur due to chance. Because higher weight might be due to steroid based medication that the subject previously took due to a disease rather than the health problem that directly derived from weight or income at the moment the data was collected could have been different from other times the subject was pregnant, and etc. there could be other factors that played a role and impact the chance of miscarriage.

Final Conclusions

The findings from the analysis reflected that weight and BMI has no statistically significant correlation with fertility of the subjects, especially in the categories of miscarriage and mean bleeding intensity. On the other hand, factors such as mean cycle length and abortion showed a more statistically significant relationship with mean bleeding intensity. According to the interesting trend observed from the relationship between mean cycle length and abortion with mean bleeding intensity, future studies could target further into studying how the mean cycle length and overall hormones or stress level which affects the cycle length has an impact on the bleeding intensity of an individual. Also, if there is a correlation between the hormones secreted and formation of a cyst (such as ovarian cyst) and how that relates to bleeding intensity. Furthermore, with the trend observed between abortion and mean bleeding intensity, future studies could investigate deeper into the effect of invasive procedures such as abortion on the fertility of individuals such as mean bleeding intensity (specifically into the benefits and or consequences of high or low mean bleeding intensity as well as menopause). Although none of the relationship could be concluded as a causation due to the definition of statistic relationship, studying the correlation would allow further understanding of the impact of different factor on woman’s fertility.