# 1. load the appropriate libraries
library(tidyverse)
library(broom)
library(table1)
library(car)
library(gtsummary)
library(modelsummary)
library(ggplot2)
library(GGally)
library(reshape2)
library(rlang)
library(dplyr)
# 2. use the read_rds file to read the dataset
fertility <- read_csv("Data/Fertility.csv")
summary(fertility)
## rownames Age LowAFC MeanAFC FSH
## Min. : 1 Min. :21.00 Min. : 0.00 Min. : 0.00 Min. : 0.500
## 1st Qu.: 84 1st Qu.:32.00 1st Qu.: 7.00 1st Qu.: 8.00 1st Qu.: 4.600
## Median :167 Median :35.00 Median :11.00 Median :12.00 Median : 5.700
## Mean :167 Mean :35.33 Mean :12.29 Mean :13.53 Mean : 5.935
## 3rd Qu.:250 3rd Qu.:39.00 3rd Qu.:15.00 3rd Qu.:17.00 3rd Qu.: 6.900
## Max. :333 Max. :46.00 Max. :41.00 Max. :51.50 Max. :16.000
## E2 MaxE2 MaxDailyGn TotalGn Oocytes
## Min. :13.00 Min. : 290 Min. :100.0 Min. : 825 Min. : 1.00
## 1st Qu.:30.00 1st Qu.: 994 1st Qu.:225.0 1st Qu.:1675 1st Qu.: 7.00
## Median :39.00 Median :1443 Median :300.0 Median :2550 Median :11.00
## Mean :41.25 Mean :1546 Mean :310.8 Mean :2831 Mean :11.84
## 3rd Qu.:52.00 3rd Qu.:1856 3rd Qu.:450.0 3rd Qu.:3962 3rd Qu.:15.00
## Max. :90.00 Max. :6242 Max. :525.0 Max. :7275 Max. :35.00
## Embryos
## Min. : 0.000
## 1st Qu.: 4.000
## Median : 6.000
## Mean : 6.727
## 3rd Qu.: 9.000
## Max. :23.000
glimpse(fertility)
## Rows: 333
## Columns: 11
## $ rownames <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ Age <dbl> 40, 37, 40, 40, 30, 29, 31, 33, 36, 35, 25, 39, 35, 30, 37,…
## $ LowAFC <dbl> 40, 41, 38, 36, 36, 35, 24, 28, 30, 32, 27, 32, 31, 18, 29,…
## $ MeanAFC <dbl> 51.5, 41.0, 41.0, 37.5, 36.0, 35.0, 35.0, 34.0, 33.0, 32.0,…
## $ FSH <dbl> 5.3, 7.1, 4.9, 3.9, 4.0, 3.9, 3.8, 4.3, 4.9, 3.7, 5.0, 5.3,…
## $ E2 <dbl> 45, 53, 40, 26, 49, 67, 49, 20, 60, 36, 20, 37, 30, 33, 40,…
## $ MaxE2 <dbl> 1427, 802, 4533, 1804, 2526, 3812, 1087, 1615, 1879, 2009, …
## $ MaxDailyGn <dbl> 300.0, 225.0, 450.0, 300.0, 150.0, 150.0, 262.5, 375.0, 300…
## $ TotalGn <dbl> 2700.0, 1800.0, 4850.0, 2700.0, 1500.0, 975.0, 2512.5, 3075…
## $ Oocytes <dbl> 25, 7, 27, 9, 19, 19, 13, 15, 23, 26, 22, 22, 7, 27, 12, 11…
## $ Embryos <dbl> 13, 6, 15, 4, 12, 16, 9, 9, 10, 8, 13, 18, 5, 18, 9, 2, 8, …
Answer:
Background:
The dataset is comprised out of fertility measurements collected from a big sample of women experiencing difficulty in conception. The full data set can be found here. These measurements were obtained by a medical team led by Dr. Priya Maseelall, who conducted research to investigate various factors related to the infertility among these women.
Research Questions:
• How do age, antral follicle count (LowAFC and MeanAFC), follicle stimulating hormone level (FSH), fertility levels (E2 and MaxE2), and gonadotropin levels (MaxDailyGn and TotalGn) relate to each other among women experiencing difficulty in getting pregnant?
• Are there any significant correlations or associations between these variables and the number of egg cells (Oocytes) or embryos (Embryos) produced?
• Does age have a significant impact on a woman’s fertility level, with older women exhibiting lower fertility levels compared to younger women?
• Is there a negative correlation between a woman’s age and the number of egg cells she produces, suggesting that older women tend to have fewer egg cells compared to younger women?
• What is the association between the average antral follicle count and the number of egg cells produced by a woman, indicating whether higher antral follicle counts are related to a greater number of egg cells?
Motivation:
Understanding the factors influencing fertility among women facing conception challenges is crucial for improving reproductive health outcomes. By further exploring the relationships between age, hormonal levels, antral follicle counts, and other variables, clinicians and researchers can better tailor fertility treatments and interventions. Previous studies have already highlighted the importance of antral follicle counts and hormonal levels in predicting ovarian reserve and fertility outcomes among women undergoing assisted reproductive technologies like IVF.
Hypotheses:
• Older age is expected to correlate negatively with antral follicle counts, ovarian reserve, and the number of egg cells and embryos produced.
• Higher levels of follicle stimulating hormone (FSH) may be associated with decreased ovarian reserve and fertility levels.
• Antral follicle counts (LowAFC and MeanAFC) are hypothesized to positively correlate with the number of egg cells and embryos.
• Elevated levels of gonadotropins (MaxDailyGn and TotalGn) may indicate increased ovarian stimulation and potentially higher production of egg cells and embryos.
Data Description:
• Number of observations: 333
• Number of variables: 10
• Variables:
Age: Age of the women in years.
LowAFC: Smallest antral follicle count.
MeanAFC: Average antral follicle count.
FSH: Maximum follicle stimulating hormone level.
E2: Fertility level.
MaxE2: Maximum fertility level.
MaxDailyGn: Maximum daily gonadotropin level.
TotalGn: Total gonadotropin level.
Oocytes: Number of egg cells.
Embryos: Number of embryos.
These variables provide insights into various aspects of fertility, including age, ovarian reserve (antral follicle counts), hormonal levels, and outcomes such as the number of egg cells and embryos produced. Further analysis of this dataset can help us understand the complex connections between these factors in women experiencing infertility.
library(tableone)
# Divide E2 into quantiles (5 parts)
fertility$E2_quantiles <- cut(fertility$E2, breaks = 5, labels = FALSE, include.lowest = TRUE, include.highest = TRUE)
# Create the summary table
table1(~ Age + MeanAFC + FSH + TotalGn + Oocytes + Embryos | factor(E2_quantiles), data = fertility, overall = FALSE)
1 (N=69) |
2 (N=125) |
3 (N=93) |
4 (N=36) |
5 (N=10) |
|
---|---|---|---|---|---|
Age | |||||
Mean (SD) | 36.6 (4.32) | 34.7 (5.02) | 35.1 (4.27) | 36.1 (5.21) | 34.5 (3.57) |
Median [Min, Max] | 37.0 [25.0, 44.0] | 35.0 [21.0, 45.0] | 35.0 [24.0, 44.0] | 35.0 [27.0, 46.0] | 35.0 [28.0, 40.0] |
MeanAFC | |||||
Mean (SD) | 15.2 (7.18) | 13.5 (7.17) | 13.2 (8.29) | 12.3 (6.81) | 10.4 (3.79) |
Median [Min, Max] | 15.0 [2.00, 37.5] | 12.0 [0, 41.0] | 12.0 [0, 51.5] | 11.3 [2.00, 35.0] | 9.25 [7.00, 18.0] |
FSH | |||||
Mean (SD) | 6.39 (2.34) | 5.84 (1.83) | 5.73 (1.82) | 5.99 (1.89) | 5.66 (1.34) |
Median [Min, Max] | 6.30 [2.30, 16.0] | 5.70 [1.90, 11.8] | 5.60 [0.500, 11.4] | 5.55 [3.30, 10.5] | 5.95 [3.20, 7.00] |
TotalGn | |||||
Mean (SD) | 3110 (1570) | 2650 (1280) | 2740 (1340) | 3180 (1390) | 2840 (940) |
Median [Min, Max] | 2850 [825, 7280] | 2180 [1010, 5850] | 2360 [925, 5850] | 3490 [975, 5400] | 2780 [1580, 4500] |
Oocytes | |||||
Mean (SD) | 12.8 (5.64) | 12.1 (6.15) | 11.4 (5.94) | 10.0 (5.41) | 11.8 (5.57) |
Median [Min, Max] | 12.0 [2.00, 30.0] | 12.0 [1.00, 28.0] | 10.0 [2.00, 35.0] | 9.50 [3.00, 29.0] | 10.5 [4.00, 23.0] |
Embryos | |||||
Mean (SD) | 7.58 (3.86) | 6.61 (3.90) | 6.47 (4.49) | 5.97 (4.07) | 7.40 (3.53) |
Median [Min, Max] | 7.00 [2.00, 21.0] | 6.00 [0, 20.0] | 6.00 [0, 23.0] | 5.50 [0, 16.0] | 6.00 [3.00, 13.0] |
#Create a histogram for E2
ggplot(fertility, aes(x = E2_quantiles)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
labs(title = "Histogram of Fertility Level (E2)", x = "Fertility Level (E2)", y = "Frequency")
#Create a boxplot for E2
ggplot(fertility, aes(x = E2)) +
geom_boxplot(fill = "skyblue", color = "black") +
labs(title = "Boxplot of Fertility Level (E2)", x = "Fertility Level (E2)", y = "Frequency")
#Boxplot of E2_Quantiles
ggplot(fertility, aes(x = E2_quantiles, y = E2, fill = E2_quantiles)) +
geom_boxplot() +
labs(title = "Boxplots of Fertility (E2) by Quantiles",
x = "Quantiles of E2",
y = "Fertility (E2)",
fill = "Quantiles") +
theme_minimal() +
facet_wrap(~E2_quantiles, scales = "free")
Answer:
With histogram we can clearly see that the most frequent fertility levels lie in the second quartile, followed by the third quartile.
## [1] 333 11
Answer:
Histogram Age: The majority of women are between the ages of 30 and 40. Histogram LowAFC and meanAFC: The distribution is skewed to the right, with a higher count at lower AFC. Histogram MaxE2: The distribution is skewed to the right, with a larger count on smaller maxE2 hence lower maximum fertility. Overal, the distributions of LowAFC, meanAFC, maxE2 and oocytes show a similar distribution which is skewed to the right.
# Create a new variable indicating quartiles of embryo numbers
final <- final %>%
mutate(embryoquartile = ntile(Embryos, 4))
final %>%
select(c(Age, LowAFC, MeanAFC, FSH, E2, MaxE2, MaxDailyGn, TotalGn, Oocytes, Embryos, embryoquartile)) %>%
tbl_summary(by = embryoquartile,
missing = "no",
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 2,
) %>%
add_n() %>%
add_p(test = list(all_continuous() ~ "aov", all_categorical() ~ "chisq.test")) %>% bold_p() %>%
modify_caption("Table 1: Descriptive Statistics of Variables by Embryos (Quartile)")
Characteristic | N | 1, N = 841 | 2, N = 831 | 3, N = 831 | 4, N = 831 | p-value2 |
---|---|---|---|---|---|---|
Age | 333 | 36.70 (4.77) | 34.93 (5.12) | 34.94 (4.47) | 34.75 (4.19) | 0.021 |
LowAFC | 333 | 10.18 (5.51) | 11.93 (7.16) | 12.00 (5.91) | 15.07 (8.02) | <0.001 |
MeanAFC | 333 | 11.35 (6.11) | 12.89 (6.99) | 13.51 (6.67) | 16.40 (8.86) | <0.001 |
FSH | 333 | 6.34 (2.09) | 6.27 (2.07) | 5.85 (1.94) | 5.29 (1.45) | 0.001 |
E2 | 333 | 42.04 (14.20) | 41.66 (16.96) | 41.13 (14.26) | 40.14 (15.54) | 0.9 |
MaxE2 | 333 | 1,184.05 (699.39) | 1,394.17 (671.54) | 1,538.73 (530.07) | 2,071.82 (896.65) | <0.001 |
MaxDailyGn | 333 | 353.57 (119.31) | 304.37 (122.78) | 303.92 (111.85) | 280.72 (96.93) | <0.001 |
TotalGn | 333 | 3,370.40 (1,524.42) | 2,746.24 (1,299.38) | 2,630.90 (1,237.98) | 2,569.15 (1,276.93) | <0.001 |
Oocytes | 333 | 7.13 (4.26) | 9.87 (4.01) | 12.49 (4.15) | 17.92 (5.05) | <0.001 |
Embryos | 333 | 2.35 (1.12) | 5.02 (0.70) | 7.27 (0.99) | 12.33 (3.17) | <0.001 |
1 Mean (SD) | ||||||
2 One-way ANOVA |
final <- final %>%
mutate(oocytequartile = ntile(Oocytes, 4))
final %>%
select(c(Age, LowAFC, MeanAFC, FSH, E2, MaxE2, MaxDailyGn, TotalGn, Oocytes, Embryos, oocytequartile)) %>%
tbl_summary(by = oocytequartile,
missing = "no",
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 2,
) %>%
add_n() %>%
add_p(test = list(all_continuous() ~ "aov", all_categorical() ~ "chisq.test")) %>% bold_p() %>%
modify_caption("Table 2: Descriptive Statistics of Variables by Oocytes (Quartile)")
Characteristic | N | 1, N = 841 | 2, N = 831 | 3, N = 831 | 4, N = 831 | p-value2 |
---|---|---|---|---|---|---|
Age | 333 | 37.07 (4.97) | 34.75 (4.45) | 34.47 (4.78) | 35.02 (4.18) | 0.001 |
LowAFC | 333 | 9.69 (5.79) | 11.66 (6.09) | 12.31 (5.94) | 15.52 (8.35) | <0.001 |
MeanAFC | 333 | 10.47 (5.93) | 12.68 (6.41) | 13.66 (6.34) | 17.35 (9.02) | <0.001 |
FSH | 333 | 6.63 (2.42) | 6.18 (1.84) | 5.68 (1.68) | 5.24 (1.43) | <0.001 |
E2 | 333 | 42.38 (14.47) | 44.36 (18.02) | 39.34 (13.58) | 38.89 (14.03) | 0.064 |
MaxE2 | 333 | 1,007.74 (492.33) | 1,426.60 (534.39) | 1,592.64 (656.26) | 2,163.92 (898.32) | <0.001 |
MaxDailyGn | 333 | 362.80 (114.63) | 300.30 (119.48) | 312.65 (113.57) | 266.72 (94.55) | <0.001 |
TotalGn | 333 | 3,424.55 (1,489.88) | 2,702.11 (1,374.35) | 2,885.26 (1,262.30) | 2,304.11 (1,108.27) | <0.001 |
Oocytes | 333 | 5.36 (1.62) | 9.19 (1.05) | 12.86 (1.17) | 20.02 (4.17) | <0.001 |
Embryos | 333 | 3.29 (1.65) | 5.23 (1.91) | 7.51 (2.82) | 10.93 (4.50) | <0.001 |
1 Mean (SD) | ||||||
2 One-way ANOVA |
Answer:
In these tables we are looking at the descritpive statistics of all of our variables of interest stratified by both embryos and oocytes. In these tables, ANOVA tests are conducted to test a difference in the means of the groups. Interestingly, it looks like a significant difference does exist among the means of the groups. To learn more about how these relationships look, let’s look at their correlations.
# correlaton analysis between variables Age, MeanAFC, FSH, E2, and TotalGn to answer out first Question.
# Select the columns of interest
variables <- fertility[, c("Age", "MeanAFC", "FSH", "E2", "TotalGn", 'Oocytes','Embryos')]
# Calculate the correlation matrix
correlation_matrix <- cor(variables)
# Display the correlation matrix
print(correlation_matrix)
## Age MeanAFC FSH E2 TotalGn Oocytes
## Age 1.00000000 -0.2296947 0.27438884 -0.02338577 0.52095315 -0.1131005
## MeanAFC -0.22969466 1.0000000 -0.29637031 -0.12732853 -0.38392056 0.4172390
## FSH 0.27438884 -0.2963703 1.00000000 -0.07135229 0.47317864 -0.2845907
## E2 -0.02338577 -0.1273285 -0.07135229 1.00000000 -0.00719138 -0.1171102
## TotalGn 0.52095315 -0.3839206 0.47317864 -0.00719138 1.00000000 -0.2649491
## Oocytes -0.11310048 0.4172390 -0.28459069 -0.11711017 -0.26494909 1.0000000
## Embryos -0.12781624 0.3464034 -0.22317166 -0.08713926 -0.20824976 0.7580988
## Embryos
## Age -0.12781624
## MeanAFC 0.34640339
## FSH -0.22317166
## E2 -0.08713926
## TotalGn -0.20824976
## Oocytes 0.75809878
## Embryos 1.00000000
# Create a heatmap of the correlation matrix
ggplot(data = melt(correlation_matrix), aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
geom_text(aes(label = round(value, 2)), color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0,
limit = c(-1, 1), space = "Lab", name="Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 10, hjust = 1)) +
coord_fixed()
Answer:
The heatmap of the correlation matrix shows the correlation coefficience between Age, MeanAFC, FSH, E2, and TotalGn. There seems to be a strong correlation between age and total gonadotropin, while also FSH and total gonadotropin, while the correlation between age and FSH is weaker. There seems to be a negative correlation between gonadotropin and mean AFC. Gonadotropin-releasing hormone causes the pituitary gland in the brain to make and secrete the hormones luteinizing hormone (LH) and follicle-stimulating hormone (FSH), hence it makes sense for a positive correlation between gonadotropin and FSH. With age there is a decrease in ovarian function as displayed by the negative correlation between age and AFC. A decrease in ovarian function means less estrogen which inhibits the negative feedback loop exerted by estrogen on the pituitary gland leading to an increase in FSH and gonadotropin. This explains the positive correlation between age and gonadotropin.
There is a strong positive correlation between the number of Embryos and Ooctytes while also a positive correlation between AFC and embryos and AFC and oocytes. However, there is a negative correlation between the number of Embryos FSH, age and Gonadotropin, while also the number of oocytes, FSH, age and gonadotropin.
This confirms our hypothesis of
• Older age is expected to correlate negatively with antral follicle
counts, ovarian reserve, and the number of egg cells and embryos
produced.
• Higher levels of follicle stimulating hormone (FSH) may be associated with decreased ovarian reserve and fertility levels.
• Antral follicle counts (LowAFC and MeanAFC) are hypothesized to positively correlate with the number of egg cells and embryos.
However it rejects our final hypothesis of:
•Elevated levels of gonadotropins (MaxDailyGn and TotalGn) may indicate increased ovarian stimulation and potentially higher production of egg cells and embryos.
It looks like every explanatory variable in the dataset is significantly correlated with both the number of Embryos and the number of Oocytes. At first this makes is seem like they are all predictors of both embryos and oocytes, but how can we make sure?
Time for a multivariate linear regression model.
First, let’s take a look at how well the descriptors can predict the variable Embryos.
model_embryos <- lm(data = final, Embryos ~ Age + MeanAFC + FSH + E2 + TotalGn + Embryos + Oocytes )
summary(model_embryos)
##
## Call:
## lm(formula = Embryos ~ Age + MeanAFC + FSH + E2 + TotalGn + Embryos +
## Oocytes, data = final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7048 -1.3105 0.0947 1.3831 10.6921
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.634e+00 1.442e+00 1.133 0.258
## Age -4.386e-02 3.673e-02 -1.194 0.233
## MeanAFC 1.939e-02 2.316e-02 0.837 0.403
## FSH -4.725e-04 8.828e-02 -0.005 0.996
## E2 9.947e-04 9.841e-03 0.101 0.920
## TotalGn 8.729e-05 1.417e-04 0.616 0.538
## Oocytes 5.148e-01 2.803e-02 18.370 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.677 on 326 degrees of freedom
## Multiple R-squared: 0.5777, Adjusted R-squared: 0.5699
## F-statistic: 74.32 on 6 and 326 DF, p-value: < 2.2e-16
Answer:
Because we know that Oocytes are the only significant predictor of the number of embryos, we need to look at which varaibles can predict the number of oocytes. For this we are going to do “stepwise regression”. In a stepwise regression we look at the model one step at a time. As we add more varaibles we see if the model quality improves, and in this way we can select the most accurate linear model.
Based on what we see here, when we correct for all of the variables together, the only variable that remains a significant predictor of the number of embryos is the number of oocytes. This makes sense, as an oocyte is always necessary for an embryo.
model_A <- lm(data = final, Oocytes ~Age)
model_B <- lm(data = final, Oocytes ~ Age + MeanAFC)
model_C <- lm(data = final, Oocytes ~ Age + MeanAFC + FSH)
model_D <- lm(data = final, Oocytes ~ Age + MeanAFC + FSH + E2)
model_E <- lm(data = final, Oocytes ~ Age + MeanAFC + FSH + E2 + TotalGn)
model_F <- lm(data = final, Oocytes ~ Age + MeanAFC + FSH + E2 + TotalGn + Embryos)
library(modelsummary)
linearmodels = list("Age Only" = model_A, "Model B" = model_B,"Model C" = model_C, "Model D" = model_D, "Model E" = model_E, "Full Model" = model_F)
modelsummary(linearmodels, statistic="{conf.low}, {conf.high}", estimate = "{estimate}, p = {p.value}")
Age Only | Model B | Model C | Model D | Model E | Full Model | |
---|---|---|---|---|---|---|
(Intercept) | 16.867, p = <0.001 | 8.200, p = 0.001 | 10.199, p = <0.001 | 12.008, p = <0.001 | 11.288, p = <0.001 | 3.932, p = 0.049 |
12.048, 21.686 | 3.303, 13.097 | 5.245, 15.152 | 6.648, 17.369 | 5.827, 16.749 | 0.018, 7.846 | |
Age | −0.142, p = 0.039 | −0.023, p = 0.723 | 0.027, p = 0.682 | 0.023, p = 0.719 | 0.065, p = 0.367 | 0.075, p = 0.139 |
−0.278, −0.007 | −0.150, 0.104 | −0.102, 0.155 | −0.105, 0.152 | −0.077, 0.208 | −0.025, 0.175 | |
MeanAFC | 0.329, p = <0.001 | 0.293, p = <0.001 | 0.282, p = <0.001 | 0.268, p = <0.001 | 0.113, p = <0.001 | |
0.248, 0.409 | 0.211, 0.375 | 0.199, 0.364 | 0.183, 0.353 | 0.051, 0.175 | ||
FSH | −0.552, p = <0.001 | −0.581, p = <0.001 | −0.501, p = 0.004 | −0.246, p = 0.044 | ||
−0.868, −0.235 | −0.898, −0.263 | −0.840, −0.163 | −0.485, −0.007 | |||
E2 | −0.033, p = 0.088 | −0.033, p = 0.087 | −0.017, p = 0.205 | |||
−0.071, 0.005 | −0.071, 0.005 | −0.044, 0.009 | ||||
TotalGn | 0.000, p = 0.188 | 0.000, p = 0.174 | ||||
−0.001, 0.000 | −0.001, 0.000 | |||||
Embryos | 0.988, p = <0.001 | |||||
0.882, 1.094 | ||||||
Num.Obs. | 333 | 333 | 333 | 333 | 333 | 333 |
R2 | 0.013 | 0.174 | 0.203 | 0.210 | 0.214 | 0.614 |
R2 Adj. | 0.010 | 0.169 | 0.196 | 0.200 | 0.202 | 0.607 |
AIC | 2129.2 | 2071.7 | 2062.0 | 2061.1 | 2061.3 | 1826.7 |
BIC | 2140.7 | 2086.9 | 2081.0 | 2083.9 | 2087.9 | 1857.1 |
Log.Lik. | −1061.616 | −1031.850 | −1026.004 | −1024.527 | −1023.642 | −905.337 |
RMSE | 5.87 | 5.36 | 5.27 | 5.25 | 5.23 | 3.67 |
Answer:
It seems like the model gets stronger with every additional variable as seen by an increase in R^2 and a decrease in AIC. The full model is thus the best model. However, not every variable is a signifiant predictor. Age, for example is a significant predictor on its own, but as soon as it becomes corrected by another variable, it loses its significance.
We already established a very strong relationship between the oocytes and the embryos. For this reason, it may also not come as a surprise that including the embryos in the model increases the quality of the model more than any other variable.
#Residual Vs Fitted
plot(model_F, which = 1)
#QQ Residuals
plot(model_F, which=2)
##Scale Location
plot(model_F, which = 3)
##Cook’s distance
plot(model_F, which=4)
Answer:
Diagnostics:
Linearity: There are equally spread residuals around a horizontal line without distinct patterns, indicating a linear relationship between predictor variables and outcome variables.
Residuals are normally distributed: residuals are lined well on the straight dashed line (up to a point, then they curve upwards)
Homoscedasticity: horizontal line with equally (randomly) spread points shows that residuals are spread equally along the ranges of predictors.
Influential observations: The plot identified the influential observation as 2, 142 and 256.If these observations are excluded, the slope coefficient changes.
Assumptions
We have the following assumptions in regression analysis:
linearity: Relationship between variables is linear. Independence of Errors: Errors are independent and random. Homoscedasticity: Constant variance of errors. Normality of Errors: Errors are normally distributed. No Perfect Multicollinearity:No perfect correlation among predictors. Additivity: Effects of predictors are additive.
Oocytes are the only significant predictor of the number of embryos. We can therefore assume that factors which influence the number of oocytes will also infuence the number of embryos. All factors including Age, MeanAFC, FSH, E2, TotalGn and Embryos appear to be good predictors by themselves, but when corrected by another variable Age, FSH and E2 become less significant since the p value increases. This phenomenon is common in regression analysis and can occur due to a variety of reasons such as:
confounding: Age might be associated with both the predictor variables (FSH and E2) and the outcome variable. Adjusting for Age helps to control for this confounding effect, revealing the true relationship between the predictors and the outcome.
colinearity: Age may be correlated with FSH and E2, leading to multicollinearity in the regression model. In such cases, the individual effects of FSH and E2 may become less distinguishable or significant when Age is included in the model.
Mediation:Age could be a mediator in the relationship between FSH, E2, and the outcome variable. Inclusion of Age in the model might partially explain the relationship between FSH, E2, and the outcome, leading to reduced significance of FSH and E2.
We effectively carried out a stepwise/hierarchical regression allowing us to assess the incremental contribution of each set of predictors to the model’s predictive power.This approach helps to assess whether the predictors of interest explain additional variance in the outcome variable beyond what is already explained by the confounding variables.
To better consider confoundability and colinearity we could have also include interaction effects between Age and other predictors.This will allow us to assess the significance of interaction terms to determine whether the relationship between predictors and the outcome varies by Age
model <- lm(data = final, Oocytes ~ Age * FSH + Age * E2 + Age * TotalGn + Age * MeanAFC + Age * Embryos)
# View the summary of the regression model
summary(model)
##
## Call:
## lm(formula = Oocytes ~ Age * FSH + Age * E2 + Age * TotalGn +
## Age * MeanAFC + Age * Embryos, data = final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.0473 -2.4122 -0.6693 1.8775 16.8489
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.483e+00 9.423e+00 0.476 0.6346
## Age 5.158e-02 2.684e-01 0.192 0.8477
## FSH 1.022e+00 1.006e+00 1.016 0.3105
## E2 -1.039e-01 1.090e-01 -0.953 0.3412
## TotalGn -9.608e-05 1.679e-03 -0.057 0.9544
## MeanAFC 2.975e-01 2.604e-01 1.143 0.2541
## Embryos -1.676e-01 4.905e-01 -0.342 0.7328
## Age:FSH -3.383e-02 2.708e-02 -1.249 0.2125
## Age:E2 2.492e-03 3.042e-03 0.819 0.4133
## Age:TotalGn -5.676e-06 4.515e-05 -0.126 0.9000
## Age:MeanAFC -5.313e-03 7.307e-03 -0.727 0.4677
## Age:Embryos 3.300e-02 1.396e-02 2.364 0.0187 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.68 on 321 degrees of freedom
## Multiple R-squared: 0.6255, Adjusted R-squared: 0.6126
## F-statistic: 48.74 on 11 and 321 DF, p-value: < 2.2e-16
Critique of Methods and Suggestions for Improvement:
Overall, the analysis conducted to investigate the relationships between age, antral follicle count (LowAFC and MeanAFC), follicle-stimulating hormone (FSH), fertility levels (E2 and MaxE2), gonadotropin levels (MaxDailyGn and TotalGn), and the number of egg cells (Oocytes) and embryos (Embryos) produced among women experiencing difficulty in conception is thorough and well-structured. However, there are several areas for improvement and considerations for enhancing the reliability and validity of the analysis:
Interaction Effects: Explore potential interaction effects between predictors (e.g., Age and FSH) to investigate whether the relationship between variables varies based on certain conditions or characteristics
Data Cleaning and Preprocessing: Ensure rigorous data cleaning procedures to identify and handle missing values, outliers, and potential data entry errors. This helps improve the quality and reliability of the analysis results.
Validity of Findings: Consider the external validity of the findings by ensuring the representativeness of the study sample and generalizability of results to the target population of women experiencing difficulty in conception.
Further Analysis: Explore additional statistical techniques such as logistic regression or survival analysis, to further investigate specific research questions or outcomes related to infertility.
Reflections and Next Steps:
If given the opportunity to start over with the project or continue working on it, several actions can be taken:
Refinement of Research Questions: Clarify and refine the research questions to focus on specific aspects of infertility and reproductive health.
Longitudinal Analysis: Consider collecting longitudinal data to examine changes in fertility-related variables over time and assess their impact on fertility outcomes. Longitudinal analysis provides insights into temporal relationships and allows for the identification of predictive factors.
Collaboration with Domain Experts: Collaborate with fertility specialists and domain experts to gain insights into the clinical relevance of the findings and ensure the applicability of the analysis results in clinical practice.
Mediation analysis: Mediation analysis helps explain how certain predictor variables influence fertility outcomes by identifying intermediate factors that explain this relationship, providing deeper insights into infertility mechanisms.
Subgroup analysis:explores how predictor variables relate to fertility outcomes across different groups (e.g., age, infertility diagnoses), revealing insights into variations within the study population.
Summary
In summary, the analysis provides valuable insights into the relationships between various factors and fertility outcomes among women experiencing difficulty in conception. By conducting exploratory data analysis, regression analysis, and considering potential confounding factors, the study enhances our understanding of infertility-related factors. While there are areas for improvement and considerations for future research, the findings contribute to the broader field of reproductive health and infertility research, potentially informing clinical practice and interventions to support women experiencing fertility challenges.