Holzgrefe_Luu_DS7130Final_Draft

Author

Emily Holzgrefe and Vy Luu

Biological factors in menstrual cycle could predict the beginning of the fertile period (measured by First High Day of Fertility) of women who are using Creighton Model natural family planning method

EV

Emily Holzgrefe & Vy Luu

Logo

QR code

library(qrcode)
qr<-qr_code("https://rpubs.com/vluu1/1426344")
plot(qr)

Introduction

The Creighton Model is one of the most well-known natural family planning methods which is used to identify fertile and infertile periods in women. It is a standardized, natural planning method based on daily tracking of cycle data. Due to the variability between women and their cycle length, it is necessary to come up with an accurate prediction in order to manage the menstrual cycle more effectively. Thus, accurate cycle prediction is also the foundation of modern period tracking apps like Clue and Flo.

Our study aims to find out how biological factors such as Length of Cycle, Length of Luteal Phase, Length of Menses , Age, BMI, and Number of Pregnancies can be used to predict the beginning of the fertile period (measured by First High Day of Fertility) in women participating in this method. This model can be applied to help women with desired pregnancy outcomes (positive or negative).

Data Cleaning & Filtering

The original data set includes menstrual and fertility data for 156 different individuals across multiple months of study. We chose several variables from the data set that we thought were most related to our variable of interest, First Day of High Fertility, based on prior knowledge and literature. In order to make the data cross-sectional for our analysis, we averaged several variables of interest for each unique Client ID and used those values as our data points. After filtering, we had 123 unique Client IDs remaining.

library(readr)
library(dplyr)

# import data
fertility <- read_csv("FedCycleData071012.csv")

# select variables of interest
fertility0 <- fertility[,c("ClientID","FirstDayofHigh","LengthofCycle",
                           "LengthofLutealPhase","LengthofMenses",
                           "Age","BMI","Numberpreg")]
fertility0

# A tibble: 1,665 × 8
   ClientID FirstDayofHigh LengthofCycle LengthofLutealPhase LengthofMenses
   <chr>             <dbl>         <dbl>               <dbl>          <dbl>
 1 nfp8122              12            29                  12              5
 2 nfp8122              13            27                  12              5
 3 nfp8122              NA            29                  14              5
 4 nfp8122              13            27                  12              5
 5 nfp8122              12            28                  12              5
 6 nfp8122              10            26                  11              5
 7 nfp8122              NA            29                  13              5
 8 nfp8122               9            24                  10              4
 9 nfp8122               9            28                  12              6
10 nfp8122              13            28                  11              5
# ℹ 1,655 more rows
# ℹ 3 more variables: Age <dbl>, BMI <dbl>, Numberpreg <dbl>

# make cross-sectional
fertility1 <- fertility |>
  group_by(ClientID) |>
  summarize(
    avg_FirstDayofHigh = mean(FirstDayofHigh,na.rm=TRUE),
    avg_LengthofCycle = mean(LengthofCycle,na.rm=TRUE),
    avg_LengthofLutealPhase = mean(LengthofLutealPhase,na.rm=TRUE),
    avg_LengthofMenses = mean(LengthofMenses,na.rm=TRUE),
    Age,
    BMI,
    Numberpreg) |>
  filter(!is.na(Age) &
           !is.na(BMI) &
           !is.na(avg_FirstDayofHigh)&
           !is.na(avg_LengthofLutealPhase))|>
  distinct()

n_distinct(fertility1$ClientID)

[1] 123

Exploratory Analysis

We conducted univariate and bivariate exploratory analyses on our variables of interest to investigate the appropriateness of a multiple linear regression model, locate possible outliers, and find any underlying patterns. The univariate analysis of our dependent variable, first day of high fertility, shows one possible outlier. The bivariate analysis does not show any concerning patterns for now.

library(ggplot2)
library(patchwork)

# univariate analysis
summary(fertility1$avg_FirstDayofHigh)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  6.800   9.955  11.500  12.003  13.614  22.000

summary(fertility1$avg_LengthofCycle)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  25.15   27.23   29.50   29.58   31.15   39.00

summary(fertility1$avg_LengthofLutealPhase)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   6.00   12.03   13.33   13.17   14.24   20.00

summary(fertility1$avg_LengthofMenses)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.400   4.582   5.200   5.309   5.962   9.000

summary(fertility1$Age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  21.00   26.00   31.00   30.73   35.50   42.00

summary(fertility1$BMI)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  16.83   20.99   23.57   25.06   27.43   49.92

summary(fertility1$Numberpreg)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00    2.00    2.52    4.00   10.00

ggplot(fertility1) +
  geom_boxplot(aes(y=avg_FirstDayofHigh), 
               fill = "lightpink")  +
  labs(y = "Average First Day of High Fertility", 
       title = "Distribution of First Day of High Fertility")+
  theme_minimal()+
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

# bivariate analysis
p1 <- ggplot(fertility1, aes(x=avg_LengthofCycle,y=avg_FirstDayofHigh)) +
  geom_point() +
  geom_smooth() + 
  labs(x = "Average Length of Cycle", y = "Average First Day of High Fertility")
p2 <- ggplot(fertility1, aes(x=avg_LengthofLutealPhase,y=avg_FirstDayofHigh)) +
  geom_point() +
  geom_smooth() + 
  labs(x = "Average Length of Luteal Phase", y = "Average First Day of High Fertility")
p3 <- ggplot(fertility1, aes(x=avg_LengthofMenses,y=avg_FirstDayofHigh)) +
  geom_point() +
  geom_smooth() + 
  labs(x = "Average Length of Menses", y = "Average First Day of High Fertility")
p4 <- ggplot(fertility1, aes(x=Age,y=avg_FirstDayofHigh)) +
  geom_point() +
  geom_smooth() + 
  labs(x = "Age", y = "Average First Day of High Fertility")
p5 <- ggplot(fertility1, aes(x=BMI,y=avg_FirstDayofHigh)) +
  geom_point() +
  geom_smooth() + 
  labs(x = "BMI", y = "Average First Day of High Fertility")
p6 <- ggplot(fertility1, aes(x=Numberpreg,y=avg_FirstDayofHigh)) +
  geom_point() +
  geom_smooth() + 
  labs(x = "Number of Pregnancies", y = "Average First Day of High Fertility")

(p1 + p2 + p3) / (p4 + p5 + p6)

Building the Model

We decided on a multiple linear regression model with the first day of high fertility as the dependent variable, and length of cycle, length of luteal phase, length of menses, age, BMI, and number of pregancies as our independent variables. The model is statistically significant with F(6,116) = 19.78 and p < 0.001. The adjusted R squared value = 0.4801, which means that 48.01% of the variation in the first day of high fertility is explained by the variables in the model. Two explanatory variables - length of cycle and length of luteal phase - were statistically significant predictors.

f_model <- lm(avg_FirstDayofHigh ~ avg_LengthofCycle + avg_LengthofLutealPhase + 
                avg_LengthofMenses + Age + BMI + Numberpreg, 
                data = fertility1)
summary(f_model)


Call:
lm(formula = avg_FirstDayofHigh ~ avg_LengthofCycle + avg_LengthofLutealPhase + 
    avg_LengthofMenses + Age + BMI + Numberpreg, data = fertility1)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.1388 -0.8475 -0.0877  1.4378  5.7112 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)             -1.38089    2.92557  -0.472    0.638    
avg_LengthofCycle        0.65009    0.06705   9.696  < 2e-16 ***
avg_LengthofLutealPhase -0.51260    0.09038  -5.672 1.05e-07 ***
avg_LengthofMenses       0.07300    0.17581   0.415    0.679    
Age                      0.04094    0.04862   0.842    0.401    
BMI                     -0.03188    0.03357  -0.950    0.344    
Numberpreg               0.02307    0.11105   0.208    0.836    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.058 on 116 degrees of freedom
Multiple R-squared:  0.5057,    Adjusted R-squared:  0.4801 
F-statistic: 19.78 on 6 and 116 DF,  p-value: 8.337e-16

Model Adequacy

We assessed our model for the following multiple linear regression assumptions:

Independence of Residuals - this was addressed when we filtered our original data. Including only one observation per Client ID ensures that all of our observations are independent from one another.
Linearity/Correct Model Specification - this is assessed using a combination of our exploratory analysis and a residual scatterplot. The scatterplot looks relatively random around zero, aside from one outlier.
Normality of Residuals - this is assessed using a QQ plot of residuals, histogram of residuals, and a Shapiro-Wilk test of normality. The QQ plot shows residuals closely following the straight line except at the tails. This is typical for a QQ plot and supports the assumption for normality. The histogram of residuals has an approximate bell-curve shape, which also supports the assumption for normality. The Shapiro-Wilk test was significant, which is evidence against normality. Our evidence points in different directions. Multiple linear regression models are robust to small violations of normality, though. For future studies, it might be worth looking into possible transformations to fix normality issues.
Equal Variance of Residuals (Homoscedasticity) - this is assessed using a residual scatterplot and a Breusch-Pagan test of homoscedasticity. The residual scatterplot does not show any obvious pattern (like coning), but the Bruesch-Pagan test was significant. This indicates a possible violation of constant variance. Again, future studies could look into transformations for fixing this assumption.
Residual Mean of Zero - this is assessed using a histogram of residuals. The histogram looks approximately centered around zero, which supports this assumption.
No Multicollinearity - this is assessed using VIF values. All values are less than 5, which indicates that the variables are not highly correlated with one another. This supports the assumption that there is no multicollinearity in the model.

library(ggpubr)
library(rstatix)
library(lmtest)
library(car)

# histogram of residuals
hist(f_model$residuals, main = " Distribution of Residuals Histogram",
                     xlab= "Residuals",
                     ylab= "Frequency",
                     col="pink",)

# QQ plot of residuals
qq_data <- data.frame(resid = residuals(f_model))
ggplot(qq_data, aes(sample = resid)) +
  stat_qq(color = "pink") + 
  stat_qq_line(color = "red") + 
  labs(title = "QQ Plot",
       x = "Theoretical Quantiles",
       y = "Standardized Residuals") +
  theme_minimal()

# residual scatterplot
f_resids <- tibble(
  fitted = f_model$fitted.values,
  residuals = f_model$residuals)
f_resids |>
  ggplot(aes(x = fitted, y = residuals)) +
  geom_point(color = "pink") +
  geom_smooth(se = FALSE, color = "red") +
  theme_minimal() +
  labs(title = "Residuals vs. Fitted Values",
       x = "Fitted Values",
       y = "Residuals")

#Shapiro-Wilk test
f_model$residuals |>
  rstatix::shapiro_test()

# A tibble: 1 × 3
  variable          statistic   p.value
  <chr>                 <dbl>     <dbl>
1 f_model$residuals     0.936 0.0000198

#Bruesch-Pagan test
lmtest::bptest(f_model)


    studentized Breusch-Pagan test

data:  f_model
BP = 18.777, df = 6, p-value = 0.004558

#VIF values
vif(f_model)

      avg_LengthofCycle avg_LengthofLutealPhase      avg_LengthofMenses 
               1.127799                1.024602                1.048986 
                    Age                     BMI              Numberpreg 
               2.142235                1.028176                2.011328

Cluster Analysis

We conducted cluster analyses on our variables of interest to explore subgroups with different reproductive profiles which might lead to different fertility timing. Target variable FirstDayofHigh and ClientID are removed in this part and the remaining variables are standardized. Even though 2 and 5 cluster groups are the best number of cluster with 7 votes. We pick 5 since 2 is a very low number of cluster that could not capture finer patterns of the data.

library(dplyr)
library(cluster)
library(NbClust)
# extract data 
fertilitydata<-fertility1[, c("avg_LengthofCycle",
                              "avg_LengthofLutealPhase",
                              "avg_LengthofMenses",
                              "Age",
                              "BMI",
                              "Numberpreg")]
#standardize data
cluster <- scale(fertilitydata)
set.seed(123)

#The optimum number of clusters
cluster1 <- NbClust(cluster,
                    distance= "euclidean",
                    min.nc= 2, max.nc= 10,
                    method= "kmeans")

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot.

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 7 proposed 2 as the best number of clusters 
* 3 proposed 3 as the best number of clusters 
* 7 proposed 5 as the best number of clusters 
* 2 proposed 7 as the best number of clusters 
* 1 proposed 9 as the best number of clusters 
* 3 proposed 10 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  2 
 
 
*******************************************************************

#Show votes for best number of clusters
cluster1$Best.nc

                     KL      CH Hartigan    CCC   Scott      Marriot   TrCovW
Number_clusters 10.0000  2.0000   5.0000  2.000  5.0000            5    3.000
Value_Index      8.8393 39.2476  12.6526 -0.374 89.1693 441361912638 1670.576
                 TraceW Friedman   Rubin Cindex      DB Silhouette   Duda
Number_clusters  5.0000   7.0000  5.0000 9.0000 10.0000     2.0000 2.0000
Value_Index     42.2635   1.8863 -0.1593 0.3027  1.4393     0.2268 1.0817
                PseudoT2   Beale Ratkowsky     Ball PtBiserial Frey McClain
Number_clusters   2.0000  2.0000    3.0000   3.0000     5.0000    1  2.0000
Value_Index      -5.6634 -0.2858    0.3135 113.9738     0.4373   NA  0.7801
                  Dunn Hubert SDindex Dindex    SDbw
Number_clusters 7.0000      0  5.0000      0 10.0000
Value_Index     0.1731      0  1.4472      0  0.4551

#k means cluster analysis
cluster2 <- pam(cluster, k = 5)
# Show medoids for the clusters
cluster2$medoids

     avg_LengthofCycle avg_LengthofLutealPhase avg_LengthofMenses         Age
[1,]        -1.0450615              -0.2431511      -0.0540231631  0.40440661
[2,]         0.1931222              -0.4428324      -0.0008672856 -1.37846123
[3,]         1.1011236               0.5700962       0.6370032446  0.04783304
[4,]         0.5363380              -0.2141066       0.0227575488  0.76098017
[5,]        -0.5141631              -0.4889127      -0.8371864252  0.40440661
             BMI Numberpreg
[1,] -0.51817286  0.2015962
[2,] -0.04754802 -1.0592343
[3,] -0.43280653 -0.2186806
[4,]  0.85148119  0.6218730
[5,]  0.13453384  1.4624267

#The cluster assignment for each row
cluster2$clustering

  [1] 1 2 1 2 1 3 3 2 2 4 4 5 4 5 4 3 2 2 4 4 1 3 2 1 5 5 2 2 2 4 2 5 5 5 3 2 5
 [38] 5 4 1 5 2 1 3 4 1 1 4 2 3 4 2 2 2 2 1 1 4 2 4 4 1 3 3 4 2 5 1 5 4 1 1 5 1
 [75] 3 1 2 1 1 2 3 1 2 1 2 3 3 3 5 1 2 3 2 5 3 2 5 2 1 2 3 2 2 2 3 3 2 3 2 1 2
[112] 3 3 3 3 2 1 4 3 5 3 5 1

#Plot the model
plot(cluster2)

#mutate cluster to dataset and compute averages
fertilitygroup <- fertilitydata |>
  mutate(cluster = cluster2$clustering) 
fertilitygroup

# A tibble: 123 × 7
   avg_LengthofCycle avg_LengthofLutealPhase avg_LengthofMenses   Age   BMI
               <dbl>                   <dbl>              <dbl> <dbl> <dbl>
 1              25.9                    12.1               5.31    33  25.8
 2              28.8                    11.2               4.67    23  25.0
 3              27.7                    11.6               6       39  21.5
 4              26.8                    13.4               5.62    23  22.5
 5              27.2                    13.1               5.33    30  21.9
 6              32.8                    14.4               6       31  22.6
 7              33.5                    15                 5.5     24  24.1
 8              32.6                    10.8               5.2     27  18.8
 9              29                      14.4               4.62    24  19.2
10              30.4                    13.7               5.45    35  30.0
# ℹ 113 more rows
# ℹ 2 more variables: Numberpreg <dbl>, cluster <int>

fertilitysummary <- fertilitygroup |>
  group_by(cluster) |>
  summarise(across(everything(), ~ round(mean(.x, na.rm = TRUE), 2)))
fertilitysummary

# A tibble: 5 × 7
  cluster avg_LengthofCycle avg_LengthofLutealPhase avg_LengthofMenses   Age
    <int>             <dbl>                   <dbl>              <dbl> <dbl>
1       1              27.1                    13.3               5.25  34.3
2       2              29.8                    12.0               5.35  24.8
3       3              32.8                    14.9               5.85  29.4
4       4              30.0                    13.5               5.61  34.8
5       5              27.7                    12.4               4.25  35.6
# ℹ 2 more variables: BMI <dbl>, Numberpreg <dbl>

Interactive Figures

library(plotly)

plot_ly(
  data=fertility1,
  x = ~avg_LengthofCycle,
  y=~avg_FirstDayofHigh,
  type="scatter",
  mode="markers",
  marker = list(color = "lightpink")
)%>%
  layout(xaxis=list(title="Average Length of Cycle"),
         yaxis=list(title="Average First Day of High Fertility")
         )

plot_ly(
  data=fertility1,
  x = ~avg_LengthofLutealPhase,
  y=~avg_FirstDayofHigh,
  type="scatter",
  mode="markers",
  marker = list(color = "lightpink")
)%>%
  layout(xaxis=list(title="Average Length of Luteal Phase"),
         yaxis=list(title="Average First Day of High Fertility"))

Interactive Table

library(DT)

fertility2 <- fertility1 |>
  mutate(across(where(is.numeric), ~ round(.x, 3)))

datatable(fertility2)

Predictive Model

We created a predictive function that takes two inputs:

x = length of cycle
y = length of luteal phase

These two inputs are the significant predictors in our model. The rest of the variables in the model are set to the average, since they are not as statistically significant. This function takes the input of x and y and predicts the first day of high fertility in the cycle of a woman with that length of cycle and length of luteal phase. Since prediction should be limited to the range of the sample, the function returns a message if x or y values go outside of the ranges covered in the sample data.

predicted_highdayfert <- function(x,y) {
  day <- round(((x*0.65009) - (y*0.51260) - 0.6343843), digits = 0)
  if (x < 24 | x > 40 | y < 6 | y > 20) {
    print("One or both of your values are outside the predictive range of this model.")
  } else {
    print(paste("Your predicted first day of high fertility is Day",day,"of your cycle."))
  }
}

# example 1:
predicted_highdayfert(28,13)

[1] "Your predicted first day of high fertility is Day 11 of your cycle."

# example 2:
predicted_highdayfert(42,20)

[1] "One or both of your values are outside the predictive range of this model."

Conclusion

This study identify two significant biologicalstatistically variables, namely length of Cycle, length of luteal phase. The remaining variables barely contributed to our dependent variable. Based the nmultiple linear regression model, we are able to predict the first day of high fertility of women who participated in the Creighton Model natural family planning method with adjusted R squared of 0.4801 = 48,01%. We divided 5 biological sub-group to explore different reproductive profiles. Notably, cluster group 2 (women at age ~25, BMI ~23.5,Numberpreg ~0.17) youngest demographic with lowest number of pregnancy and lowest length of luteal phase cluster 3 (women at age ~29, BMI ~22.9,Numberpreg ~2) has highest length of Cycle, length of Luteal Phase,length of Menses. cluster group 1 (women at age ~34, BMI ~23.1,Numberpreg ~3) has lowest length of cycle cluster group 4 (women at age ~35, BMI ~33.8,Numberpreg ~4) has highest BMI. cluster group 5 (women at age ~36, BMI ~25.9,Numberpreg ~6) oldest demographic with lowest length of menses. These patterns may reflect underlying biological differences in reproductive aging across groups. ###Limitation dataset did not record demographic data ###Future research

KSU Interactive Map & Contacts

Map

library(leaflet)

leaflet() %>%
  addTiles()%>%
  addMarkers(
    lng=-84.5184,
    lat=33.9384,
    popup="Kennesaw State University - Marietta Campus"
  )

Contacts

Emily Holzgrefe - eholzgre@kennesaw.edu

Vy Luu - vluu1@students.kennesaw.edu