Lab 5: Understanding the factors influencing fertility among women facing conception challenges

# 1. load the appropriate libraries
library(tidyverse)
library(broom)
library(table1)
library(car)
library(gtsummary)
library(modelsummary)
library(ggplot2)
library(GGally)
library(reshape2)
library(rlang)
library(dplyr)

# 2. use the read_rds file to read the dataset
fertility <- read_csv("Data/Fertility.csv")

Section 1: Introduction

summary(fertility)

##     rownames        Age            LowAFC         MeanAFC           FSH        
##  Min.   :  1   Min.   :21.00   Min.   : 0.00   Min.   : 0.00   Min.   : 0.500  
##  1st Qu.: 84   1st Qu.:32.00   1st Qu.: 7.00   1st Qu.: 8.00   1st Qu.: 4.600  
##  Median :167   Median :35.00   Median :11.00   Median :12.00   Median : 5.700  
##  Mean   :167   Mean   :35.33   Mean   :12.29   Mean   :13.53   Mean   : 5.935  
##  3rd Qu.:250   3rd Qu.:39.00   3rd Qu.:15.00   3rd Qu.:17.00   3rd Qu.: 6.900  
##  Max.   :333   Max.   :46.00   Max.   :41.00   Max.   :51.50   Max.   :16.000  
##        E2            MaxE2        MaxDailyGn       TotalGn        Oocytes     
##  Min.   :13.00   Min.   : 290   Min.   :100.0   Min.   : 825   Min.   : 1.00  
##  1st Qu.:30.00   1st Qu.: 994   1st Qu.:225.0   1st Qu.:1675   1st Qu.: 7.00  
##  Median :39.00   Median :1443   Median :300.0   Median :2550   Median :11.00  
##  Mean   :41.25   Mean   :1546   Mean   :310.8   Mean   :2831   Mean   :11.84  
##  3rd Qu.:52.00   3rd Qu.:1856   3rd Qu.:450.0   3rd Qu.:3962   3rd Qu.:15.00  
##  Max.   :90.00   Max.   :6242   Max.   :525.0   Max.   :7275   Max.   :35.00  
##     Embryos      
##  Min.   : 0.000  
##  1st Qu.: 4.000  
##  Median : 6.000  
##  Mean   : 6.727  
##  3rd Qu.: 9.000  
##  Max.   :23.000

glimpse(fertility)

## Rows: 333
## Columns: 11
## $ rownames   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ Age        <dbl> 40, 37, 40, 40, 30, 29, 31, 33, 36, 35, 25, 39, 35, 30, 37,…
## $ LowAFC     <dbl> 40, 41, 38, 36, 36, 35, 24, 28, 30, 32, 27, 32, 31, 18, 29,…
## $ MeanAFC    <dbl> 51.5, 41.0, 41.0, 37.5, 36.0, 35.0, 35.0, 34.0, 33.0, 32.0,…
## $ FSH        <dbl> 5.3, 7.1, 4.9, 3.9, 4.0, 3.9, 3.8, 4.3, 4.9, 3.7, 5.0, 5.3,…
## $ E2         <dbl> 45, 53, 40, 26, 49, 67, 49, 20, 60, 36, 20, 37, 30, 33, 40,…
## $ MaxE2      <dbl> 1427, 802, 4533, 1804, 2526, 3812, 1087, 1615, 1879, 2009, …
## $ MaxDailyGn <dbl> 300.0, 225.0, 450.0, 300.0, 150.0, 150.0, 262.5, 375.0, 300…
## $ TotalGn    <dbl> 2700.0, 1800.0, 4850.0, 2700.0, 1500.0, 975.0, 2512.5, 3075…
## $ Oocytes    <dbl> 25, 7, 27, 9, 19, 19, 13, 15, 23, 26, 22, 22, 7, 27, 12, 11…
## $ Embryos    <dbl> 13, 6, 15, 4, 12, 16, 9, 9, 10, 8, 13, 18, 5, 18, 9, 2, 8, …

Answer:

Background:

The dataset is comprised out of fertility measurements collected from a big sample of women experiencing difficulty in conception. The full data set can be found here. These measurements were obtained by a medical team led by Dr. Priya Maseelall, who conducted research to investigate various factors related to the infertility among these women.

Research Questions:

• How do age, antral follicle count (LowAFC and MeanAFC), follicle stimulating hormone level (FSH), fertility levels (E2 and MaxE2), and gonadotropin levels (MaxDailyGn and TotalGn) relate to each other among women experiencing difficulty in getting pregnant?

• Are there any significant correlations or associations between these variables and the number of egg cells (Oocytes) or embryos (Embryos) produced?

• Does age have a significant impact on a woman’s fertility level, with older women exhibiting lower fertility levels compared to younger women?

• Is there a negative correlation between a woman’s age and the number of egg cells she produces, suggesting that older women tend to have fewer egg cells compared to younger women?

• What is the association between the average antral follicle count and the number of egg cells produced by a woman, indicating whether higher antral follicle counts are related to a greater number of egg cells?

Motivation:

Understanding the factors influencing fertility among women facing conception challenges is crucial for improving reproductive health outcomes. By further exploring the relationships between age, hormonal levels, antral follicle counts, and other variables, clinicians and researchers can better tailor fertility treatments and interventions. Previous studies have already highlighted the importance of antral follicle counts and hormonal levels in predicting ovarian reserve and fertility outcomes among women undergoing assisted reproductive technologies like IVF.

Hypotheses:

• Older age is expected to correlate negatively with antral follicle counts, ovarian reserve, and the number of egg cells and embryos produced.

• Higher levels of follicle stimulating hormone (FSH) may be associated with decreased ovarian reserve and fertility levels.

• Antral follicle counts (LowAFC and MeanAFC) are hypothesized to positively correlate with the number of egg cells and embryos.

• Elevated levels of gonadotropins (MaxDailyGn and TotalGn) may indicate increased ovarian stimulation and potentially higher production of egg cells and embryos.

Data Description:

• Number of observations: 333

• Number of variables: 10

• Variables:

Age: Age of the women in years.
LowAFC: Smallest antral follicle count.
MeanAFC: Average antral follicle count.
FSH: Maximum follicle stimulating hormone level.
E2: Fertility level.
MaxE2: Maximum fertility level.
MaxDailyGn: Maximum daily gonadotropin level.
TotalGn: Total gonadotropin level.
Oocytes: Number of egg cells.
Embryos: Number of embryos.

These variables provide insights into various aspects of fertility, including age, ovarian reserve (antral follicle counts), hormonal levels, and outcomes such as the number of egg cells and embryos produced. Further analysis of this dataset can help us understand the complex connections between these factors in women experiencing infertility.

Section 2: Exploratory Data Analysis

library(tableone)

# Divide E2 into quantiles (5 parts)
fertility$E2_quantiles <- cut(fertility$E2, breaks = 5, labels = FALSE, include.lowest = TRUE, include.highest = TRUE)

# Create the summary table
table1(~ Age + MeanAFC + FSH + TotalGn + Oocytes + Embryos | factor(E2_quantiles), data = fertility, overall = FALSE)

	1 (N=69)	2 (N=125)	3 (N=93)	4 (N=36)	5 (N=10)
Age
Mean (SD)	36.6 (4.32)	34.7 (5.02)	35.1 (4.27)	36.1 (5.21)	34.5 (3.57)
Median [Min, Max]	37.0 [25.0, 44.0]	35.0 [21.0, 45.0]	35.0 [24.0, 44.0]	35.0 [27.0, 46.0]	35.0 [28.0, 40.0]
MeanAFC
Mean (SD)	15.2 (7.18)	13.5 (7.17)	13.2 (8.29)	12.3 (6.81)	10.4 (3.79)
Median [Min, Max]	15.0 [2.00, 37.5]	12.0 [0, 41.0]	12.0 [0, 51.5]	11.3 [2.00, 35.0]	9.25 [7.00, 18.0]
FSH
Mean (SD)	6.39 (2.34)	5.84 (1.83)	5.73 (1.82)	5.99 (1.89)	5.66 (1.34)
Median [Min, Max]	6.30 [2.30, 16.0]	5.70 [1.90, 11.8]	5.60 [0.500, 11.4]	5.55 [3.30, 10.5]	5.95 [3.20, 7.00]
TotalGn
Mean (SD)	3110 (1570)	2650 (1280)	2740 (1340)	3180 (1390)	2840 (940)
Median [Min, Max]	2850 [825, 7280]	2180 [1010, 5850]	2360 [925, 5850]	3490 [975, 5400]	2780 [1580, 4500]
Oocytes
Mean (SD)	12.8 (5.64)	12.1 (6.15)	11.4 (5.94)	10.0 (5.41)	11.8 (5.57)
Median [Min, Max]	12.0 [2.00, 30.0]	12.0 [1.00, 28.0]	10.0 [2.00, 35.0]	9.50 [3.00, 29.0]	10.5 [4.00, 23.0]
Embryos
Mean (SD)	7.58 (3.86)	6.61 (3.90)	6.47 (4.49)	5.97 (4.07)	7.40 (3.53)
Median [Min, Max]	7.00 [2.00, 21.0]	6.00 [0, 20.0]	6.00 [0, 23.0]	5.50 [0, 16.0]	6.00 [3.00, 13.0]

#Create a histogram for E2
ggplot(fertility, aes(x = E2_quantiles)) +
  geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
  labs(title = "Histogram of Fertility Level (E2)", x = "Fertility Level (E2)", y = "Frequency")

#Create a boxplot for E2
ggplot(fertility, aes(x = E2)) +
  geom_boxplot(fill = "skyblue", color = "black") +
  labs(title = "Boxplot of Fertility Level (E2)", x = "Fertility Level (E2)", y = "Frequency")

#Boxplot of E2_Quantiles
ggplot(fertility, aes(x = E2_quantiles, y = E2, fill = E2_quantiles)) +
  geom_boxplot() +
  labs(title = "Boxplots of Fertility (E2) by Quantiles",
       x = "Quantiles of E2",
       y = "Fertility (E2)",
       fill = "Quantiles") +
  theme_minimal() +
  facet_wrap(~E2_quantiles, scales = "free")

Answer:

With histogram we can clearly see that the most frequent fertility levels lie in the second quartile, followed by the third quartile.

Section 2:Explanatory data analysis Histograms

## [1] 333  11

Answer:

Histogram Age: The majority of women are between the ages of 30 and 40. Histogram LowAFC and meanAFC: The distribution is skewed to the right, with a higher count at lower AFC. Histogram MaxE2: The distribution is skewed to the right, with a larger count on smaller maxE2 hence lower maximum fertility. Overal, the distributions of LowAFC, meanAFC, maxE2 and oocytes show a similar distribution which is skewed to the right.

Section 2:Explanatory data analysis Baseline Tables

# Create a new variable indicating quartiles of embryo numbers
final <- final %>%
  mutate(embryoquartile = ntile(Embryos, 4))

final %>% 
  select(c(Age, LowAFC, MeanAFC, FSH, E2, MaxE2, MaxDailyGn, TotalGn, Oocytes, Embryos, embryoquartile)) %>% 
  tbl_summary(by = embryoquartile,
              missing = "no", 
              statistic = list(
                all_continuous() ~ "{mean} ({sd})",
                all_categorical() ~ "{n} ({p}%)"
              ), 
              digits = all_continuous() ~ 2, 
            
              ) %>%
  add_n() %>%
  add_p(test = list(all_continuous() ~ "aov", all_categorical() ~ "chisq.test"))  %>% bold_p() %>%
  modify_caption("Table 1: Descriptive Statistics of Variables by Embryos (Quartile)")

Table 1: Descriptive Statistics of Variables by Embryos (Quartile)
Characteristic	N	1, N = 84¹	2, N = 83¹	3, N = 83¹	4, N = 83¹	p-value²
Age	333	36.70 (4.77)	34.93 (5.12)	34.94 (4.47)	34.75 (4.19)	0.021
LowAFC	333	10.18 (5.51)	11.93 (7.16)	12.00 (5.91)	15.07 (8.02)	<0.001
MeanAFC	333	11.35 (6.11)	12.89 (6.99)	13.51 (6.67)	16.40 (8.86)	<0.001
FSH	333	6.34 (2.09)	6.27 (2.07)	5.85 (1.94)	5.29 (1.45)	0.001
E2	333	42.04 (14.20)	41.66 (16.96)	41.13 (14.26)	40.14 (15.54)	0.9
MaxE2	333	1,184.05 (699.39)	1,394.17 (671.54)	1,538.73 (530.07)	2,071.82 (896.65)	<0.001
MaxDailyGn	333	353.57 (119.31)	304.37 (122.78)	303.92 (111.85)	280.72 (96.93)	<0.001
TotalGn	333	3,370.40 (1,524.42)	2,746.24 (1,299.38)	2,630.90 (1,237.98)	2,569.15 (1,276.93)	<0.001
Oocytes	333	7.13 (4.26)	9.87 (4.01)	12.49 (4.15)	17.92 (5.05)	<0.001
Embryos	333	2.35 (1.12)	5.02 (0.70)	7.27 (0.99)	12.33 (3.17)	<0.001
¹ Mean (SD)
² One-way ANOVA

final <- final %>%
  mutate(oocytequartile = ntile(Oocytes, 4))


final %>% 
  select(c(Age, LowAFC, MeanAFC, FSH, E2, MaxE2, MaxDailyGn, TotalGn, Oocytes, Embryos, oocytequartile)) %>% 
  tbl_summary(by = oocytequartile,
              missing = "no", 
              statistic = list(
                all_continuous() ~ "{mean} ({sd})",
                all_categorical() ~ "{n} ({p}%)"
              ), 
              digits = all_continuous() ~ 2, 
            
              ) %>%
  add_n() %>%
  add_p(test = list(all_continuous() ~ "aov", all_categorical() ~ "chisq.test"))  %>% bold_p() %>%
  modify_caption("Table 2: Descriptive Statistics of Variables by Oocytes (Quartile)")

Table 2: Descriptive Statistics of Variables by Oocytes (Quartile)
Characteristic	N	1, N = 84¹	2, N = 83¹	3, N = 83¹	4, N = 83¹	p-value²
Age	333	37.07 (4.97)	34.75 (4.45)	34.47 (4.78)	35.02 (4.18)	0.001
LowAFC	333	9.69 (5.79)	11.66 (6.09)	12.31 (5.94)	15.52 (8.35)	<0.001
MeanAFC	333	10.47 (5.93)	12.68 (6.41)	13.66 (6.34)	17.35 (9.02)	<0.001
FSH	333	6.63 (2.42)	6.18 (1.84)	5.68 (1.68)	5.24 (1.43)	<0.001
E2	333	42.38 (14.47)	44.36 (18.02)	39.34 (13.58)	38.89 (14.03)	0.064
MaxE2	333	1,007.74 (492.33)	1,426.60 (534.39)	1,592.64 (656.26)	2,163.92 (898.32)	<0.001
MaxDailyGn	333	362.80 (114.63)	300.30 (119.48)	312.65 (113.57)	266.72 (94.55)	<0.001
TotalGn	333	3,424.55 (1,489.88)	2,702.11 (1,374.35)	2,885.26 (1,262.30)	2,304.11 (1,108.27)	<0.001
Oocytes	333	5.36 (1.62)	9.19 (1.05)	12.86 (1.17)	20.02 (4.17)	<0.001
Embryos	333	3.29 (1.65)	5.23 (1.91)	7.51 (2.82)	10.93 (4.50)	<0.001
¹ Mean (SD)
² One-way ANOVA

Answer:

In these tables we are looking at the descritpive statistics of all of our variables of interest stratified by both embryos and oocytes. In these tables, ANOVA tests are conducted to test a difference in the means of the groups. Interestingly, it looks like a significant difference does exist among the means of the groups. To learn more about how these relationships look, let’s look at their correlations.

Section 2: Correlation Analysis:

# correlaton analysis between variables Age, MeanAFC, FSH, E2, and TotalGn to answer out first Question. 
# Select the columns of interest
variables <- fertility[, c("Age", "MeanAFC", "FSH", "E2", "TotalGn", 'Oocytes','Embryos')]

# Calculate the correlation matrix
correlation_matrix <- cor(variables)

# Display the correlation matrix
print(correlation_matrix)

##                 Age    MeanAFC         FSH          E2     TotalGn    Oocytes
## Age      1.00000000 -0.2296947  0.27438884 -0.02338577  0.52095315 -0.1131005
## MeanAFC -0.22969466  1.0000000 -0.29637031 -0.12732853 -0.38392056  0.4172390
## FSH      0.27438884 -0.2963703  1.00000000 -0.07135229  0.47317864 -0.2845907
## E2      -0.02338577 -0.1273285 -0.07135229  1.00000000 -0.00719138 -0.1171102
## TotalGn  0.52095315 -0.3839206  0.47317864 -0.00719138  1.00000000 -0.2649491
## Oocytes -0.11310048  0.4172390 -0.28459069 -0.11711017 -0.26494909  1.0000000
## Embryos -0.12781624  0.3464034 -0.22317166 -0.08713926 -0.20824976  0.7580988
##             Embryos
## Age     -0.12781624
## MeanAFC  0.34640339
## FSH     -0.22317166
## E2      -0.08713926
## TotalGn -0.20824976
## Oocytes  0.75809878
## Embryos  1.00000000

# Create a heatmap of the correlation matrix
ggplot(data = melt(correlation_matrix), aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, 2)), color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0,
                       limit = c(-1, 1), space = "Lab", name="Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 10, hjust = 1)) +
  coord_fixed()

Answer:

The heatmap of the correlation matrix shows the correlation coefficience between Age, MeanAFC, FSH, E2, and TotalGn. There seems to be a strong correlation between age and total gonadotropin, while also FSH and total gonadotropin, while the correlation between age and FSH is weaker. There seems to be a negative correlation between gonadotropin and mean AFC. Gonadotropin-releasing hormone causes the pituitary gland in the brain to make and secrete the hormones luteinizing hormone (LH) and follicle-stimulating hormone (FSH), hence it makes sense for a positive correlation between gonadotropin and FSH. With age there is a decrease in ovarian function as displayed by the negative correlation between age and AFC. A decrease in ovarian function means less estrogen which inhibits the negative feedback loop exerted by estrogen on the pituitary gland leading to an increase in FSH and gonadotropin. This explains the positive correlation between age and gonadotropin.

There is a strong positive correlation between the number of Embryos and Ooctytes while also a positive correlation between AFC and embryos and AFC and oocytes. However, there is a negative correlation between the number of Embryos FSH, age and Gonadotropin, while also the number of oocytes, FSH, age and gonadotropin.

This confirms our hypothesis of
• Older age is expected to correlate negatively with antral follicle counts, ovarian reserve, and the number of egg cells and embryos produced.

• Higher levels of follicle stimulating hormone (FSH) may be associated with decreased ovarian reserve and fertility levels.

• Antral follicle counts (LowAFC and MeanAFC) are hypothesized to positively correlate with the number of egg cells and embryos.

However it rejects our final hypothesis of:

•Elevated levels of gonadotropins (MaxDailyGn and TotalGn) may indicate increased ovarian stimulation and potentially higher production of egg cells and embryos.

It looks like every explanatory variable in the dataset is significantly correlated with both the number of Embryos and the number of Oocytes. At first this makes is seem like they are all predictors of both embryos and oocytes, but how can we make sure?

Time for a multivariate linear regression model.

First, let’s take a look at how well the descriptors can predict the variable Embryos.

Section 3: Regression Analysis

model_embryos <- lm(data = final, Embryos ~ Age + MeanAFC + FSH + E2  + TotalGn  + Embryos + Oocytes )

summary(model_embryos)

## 
## Call:
## lm(formula = Embryos ~ Age + MeanAFC + FSH + E2 + TotalGn + Embryos + 
##     Oocytes, data = final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7048 -1.3105  0.0947  1.3831 10.6921 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.634e+00  1.442e+00   1.133    0.258    
## Age         -4.386e-02  3.673e-02  -1.194    0.233    
## MeanAFC      1.939e-02  2.316e-02   0.837    0.403    
## FSH         -4.725e-04  8.828e-02  -0.005    0.996    
## E2           9.947e-04  9.841e-03   0.101    0.920    
## TotalGn      8.729e-05  1.417e-04   0.616    0.538    
## Oocytes      5.148e-01  2.803e-02  18.370   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.677 on 326 degrees of freedom
## Multiple R-squared:  0.5777, Adjusted R-squared:  0.5699 
## F-statistic: 74.32 on 6 and 326 DF,  p-value: < 2.2e-16

Answer:

Because we know that Oocytes are the only significant predictor of the number of embryos, we need to look at which varaibles can predict the number of oocytes. For this we are going to do “stepwise regression”. In a stepwise regression we look at the model one step at a time. As we add more varaibles we see if the model quality improves, and in this way we can select the most accurate linear model.

Based on what we see here, when we correct for all of the variables together, the only variable that remains a significant predictor of the number of embryos is the number of oocytes. This makes sense, as an oocyte is always necessary for an embryo.

Section 3: Regression Analysis, Linear Models

model_A <- lm(data = final, Oocytes ~Age) 

model_B <- lm(data = final, Oocytes ~ Age + MeanAFC)

model_C <- lm(data = final, Oocytes ~ Age + MeanAFC + FSH)

model_D <- lm(data = final, Oocytes ~ Age + MeanAFC + FSH + E2)

model_E <- lm(data = final, Oocytes ~ Age + MeanAFC + FSH + E2 + TotalGn)

model_F <- lm(data = final, Oocytes ~ Age + MeanAFC + FSH + E2 + TotalGn + Embryos)




library(modelsummary)

linearmodels = list("Age Only" = model_A, "Model B" = model_B,"Model C" = model_C, "Model D" = model_D, "Model E" = model_E, "Full Model" = model_F)

modelsummary(linearmodels, statistic="{conf.low}, {conf.high}", estimate = "{estimate}, p = {p.value}")

	Age Only	Model B	Model C	Model D	Model E	Full Model
(Intercept)	16.867, p = <0.001	8.200, p = 0.001	10.199, p = <0.001	12.008, p = <0.001	11.288, p = <0.001	3.932, p = 0.049
	12.048, 21.686	3.303, 13.097	5.245, 15.152	6.648, 17.369	5.827, 16.749	0.018, 7.846
Age	−0.142, p = 0.039	−0.023, p = 0.723	0.027, p = 0.682	0.023, p = 0.719	0.065, p = 0.367	0.075, p = 0.139
	−0.278, −0.007	−0.150, 0.104	−0.102, 0.155	−0.105, 0.152	−0.077, 0.208	−0.025, 0.175
MeanAFC		0.329, p = <0.001	0.293, p = <0.001	0.282, p = <0.001	0.268, p = <0.001	0.113, p = <0.001
		0.248, 0.409	0.211, 0.375	0.199, 0.364	0.183, 0.353	0.051, 0.175
FSH			−0.552, p = <0.001	−0.581, p = <0.001	−0.501, p = 0.004	−0.246, p = 0.044
			−0.868, −0.235	−0.898, −0.263	−0.840, −0.163	−0.485, −0.007
E2				−0.033, p = 0.088	−0.033, p = 0.087	−0.017, p = 0.205
				−0.071, 0.005	−0.071, 0.005	−0.044, 0.009
TotalGn					0.000, p = 0.188	0.000, p = 0.174
					−0.001, 0.000	−0.001, 0.000
Embryos						0.988, p = <0.001
						0.882, 1.094
Num.Obs.	333	333	333	333	333	333
R2	0.013	0.174	0.203	0.210	0.214	0.614
R2 Adj.	0.010	0.169	0.196	0.200	0.202	0.607
AIC	2129.2	2071.7	2062.0	2061.1	2061.3	1826.7
BIC	2140.7	2086.9	2081.0	2083.9	2087.9	1857.1
Log.Lik.	−1061.616	−1031.850	−1026.004	−1024.527	−1023.642	−905.337
RMSE	5.87	5.36	5.27	5.25	5.23	3.67

Answer:

It seems like the model gets stronger with every additional variable as seen by an increase in R^2 and a decrease in AIC. The full model is thus the best model. However, not every variable is a signifiant predictor. Age, for example is a significant predictor on its own, but as soon as it becomes corrected by another variable, it loses its significance.

We already established a very strong relationship between the oocytes and the embryos. For this reason, it may also not come as a surprise that including the embryos in the model increases the quality of the model more than any other variable.

Diagnostics

#Residual Vs Fitted
plot(model_F, which = 1)

#QQ Residuals
plot(model_F, which=2)

##Scale Location
 plot(model_F, which = 3)

 ##Cook’s distance
 plot(model_F, which=4)

Answer:

Diagnostics:

Linearity: There are equally spread residuals around a horizontal line without distinct patterns, indicating a linear relationship between predictor variables and outcome variables.
Residuals are normally distributed: residuals are lined well on the straight dashed line (up to a point, then they curve upwards)
Homoscedasticity: horizontal line with equally (randomly) spread points shows that residuals are spread equally along the ranges of predictors.
Influential observations: The plot identified the influential observation as 2, 142 and 256.If these observations are excluded, the slope coefficient changes.

Assumptions

We have the following assumptions in regression analysis:

linearity: Relationship between variables is linear. Independence of Errors: Errors are independent and random. Homoscedasticity: Constant variance of errors. Normality of Errors: Errors are normally distributed. No Perfect Multicollinearity:No perfect correlation among predictors. Additivity: Effects of predictors are additive.

Section 4: Discussion and Conclusion

Oocytes are the only significant predictor of the number of embryos. We can therefore assume that factors which influence the number of oocytes will also infuence the number of embryos. All factors including Age, MeanAFC, FSH, E2, TotalGn and Embryos appear to be good predictors by themselves, but when corrected by another variable Age, FSH and E2 become less significant since the p value increases. This phenomenon is common in regression analysis and can occur due to a variety of reasons such as:

confounding: Age might be associated with both the predictor variables (FSH and E2) and the outcome variable. Adjusting for Age helps to control for this confounding effect, revealing the true relationship between the predictors and the outcome.
colinearity: Age may be correlated with FSH and E2, leading to multicollinearity in the regression model. In such cases, the individual effects of FSH and E2 may become less distinguishable or significant when Age is included in the model.
Mediation:Age could be a mediator in the relationship between FSH, E2, and the outcome variable. Inclusion of Age in the model might partially explain the relationship between FSH, E2, and the outcome, leading to reduced significance of FSH and E2.

We effectively carried out a stepwise/hierarchical regression allowing us to assess the incremental contribution of each set of predictors to the model’s predictive power.This approach helps to assess whether the predictors of interest explain additional variance in the outcome variable beyond what is already explained by the confounding variables.

To better consider confoundability and colinearity we could have also include interaction effects between Age and other predictors.This will allow us to assess the significance of interaction terms to determine whether the relationship between predictors and the outcome varies by Age

Insertion Effects

model <- lm(data = final, Oocytes ~ Age * FSH + Age * E2 + Age * TotalGn + Age * MeanAFC + Age * Embryos)

# View the summary of the regression model
summary(model)

## 
## Call:
## lm(formula = Oocytes ~ Age * FSH + Age * E2 + Age * TotalGn + 
##     Age * MeanAFC + Age * Embryos, data = final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.0473 -2.4122 -0.6693  1.8775 16.8489 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  4.483e+00  9.423e+00   0.476   0.6346  
## Age          5.158e-02  2.684e-01   0.192   0.8477  
## FSH          1.022e+00  1.006e+00   1.016   0.3105  
## E2          -1.039e-01  1.090e-01  -0.953   0.3412  
## TotalGn     -9.608e-05  1.679e-03  -0.057   0.9544  
## MeanAFC      2.975e-01  2.604e-01   1.143   0.2541  
## Embryos     -1.676e-01  4.905e-01  -0.342   0.7328  
## Age:FSH     -3.383e-02  2.708e-02  -1.249   0.2125  
## Age:E2       2.492e-03  3.042e-03   0.819   0.4133  
## Age:TotalGn -5.676e-06  4.515e-05  -0.126   0.9000  
## Age:MeanAFC -5.313e-03  7.307e-03  -0.727   0.4677  
## Age:Embryos  3.300e-02  1.396e-02   2.364   0.0187 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.68 on 321 degrees of freedom
## Multiple R-squared:  0.6255, Adjusted R-squared:  0.6126 
## F-statistic: 48.74 on 11 and 321 DF,  p-value: < 2.2e-16

Critique of Methods and Suggestions for Improvement:

Overall, the analysis conducted to investigate the relationships between age, antral follicle count (LowAFC and MeanAFC), follicle-stimulating hormone (FSH), fertility levels (E2 and MaxE2), gonadotropin levels (MaxDailyGn and TotalGn), and the number of egg cells (Oocytes) and embryos (Embryos) produced among women experiencing difficulty in conception is thorough and well-structured. However, there are several areas for improvement and considerations for enhancing the reliability and validity of the analysis:

Interaction Effects: Explore potential interaction effects between predictors (e.g., Age and FSH) to investigate whether the relationship between variables varies based on certain conditions or characteristics
Data Cleaning and Preprocessing: Ensure rigorous data cleaning procedures to identify and handle missing values, outliers, and potential data entry errors. This helps improve the quality and reliability of the analysis results.
Validity of Findings: Consider the external validity of the findings by ensuring the representativeness of the study sample and generalizability of results to the target population of women experiencing difficulty in conception.
Further Analysis: Explore additional statistical techniques such as logistic regression or survival analysis, to further investigate specific research questions or outcomes related to infertility.

Reflections and Next Steps:

If given the opportunity to start over with the project or continue working on it, several actions can be taken:

Refinement of Research Questions: Clarify and refine the research questions to focus on specific aspects of infertility and reproductive health.
Longitudinal Analysis: Consider collecting longitudinal data to examine changes in fertility-related variables over time and assess their impact on fertility outcomes. Longitudinal analysis provides insights into temporal relationships and allows for the identification of predictive factors.
Collaboration with Domain Experts: Collaborate with fertility specialists and domain experts to gain insights into the clinical relevance of the findings and ensure the applicability of the analysis results in clinical practice.
Mediation analysis: Mediation analysis helps explain how certain predictor variables influence fertility outcomes by identifying intermediate factors that explain this relationship, providing deeper insights into infertility mechanisms.
Subgroup analysis:explores how predictor variables relate to fertility outcomes across different groups (e.g., age, infertility diagnoses), revealing insights into variations within the study population.

Summary

In summary, the analysis provides valuable insights into the relationships between various factors and fertility outcomes among women experiencing difficulty in conception. By conducting exploratory data analysis, regression analysis, and considering potential confounding factors, the study enhances our understanding of infertility-related factors. While there are areas for improvement and considerations for future research, the findings contribute to the broader field of reproductive health and infertility research, potentially informing clinical practice and interventions to support women experiencing fertility challenges.

Lab 5: Understanding the factors influencing fertility among women facing conception challenges

Anna Themistokleous, Sonia Kakoulli, Yannick Jung

24 March 2024

Section 1: Introduction

Section 2: Exploratory Data Analysis

Section 2:Explanatory data analysis Histograms

Section 2:Explanatory data analysis Baseline Tables

Section 2: Correlation Analysis:

Section 3: Regression Analysis

Section 3: Regression Analysis, Linear Models

Diagnostics

Section 4: Discussion and Conclusion

Insertion Effects