library(qrcode)
qr<-qr_code("https://rpubs.com/vluu1/1426344")
plot(qr)Holzgrefe_Luu_DS7130Final_Draft
Biological factors in menstrual cycle could predict the beginning of the fertile period (measured by First High Day of Fertility) of women who are using Creighton Model natural family planning method
EV
Emily Holzgrefe & Vy Luu
Logo
QR code
Introduction
The Creighton Model is one of the most well-known natural family planning methods which is used to identify fertile and infertile periods in women. It is a standardized, natural planning method based on daily tracking of cycle data. Due to the variability between women and their cycle length, it is necessary to come up with an accurate prediction in order to manage the menstrual cycle more effectively. Thus, accurate cycle prediction is also the foundation of modern period tracking apps like Clue and Flo.
Our study aims to find out how biological factors such as Length of Cycle, Length of Luteal Phase, Length of Menses , Age, BMI, and Number of Pregnancies can be used to predict the beginning of the fertile period (measured by First High Day of Fertility) in women participating in this method. This model can be applied to help women with desired pregnancy outcomes (positive or negative).
Data Cleaning & Filtering
The original data set includes menstrual and fertility data for 156 different individuals across multiple months of study. We chose several variables from the data set that we thought were most related to our variable of interest, First Day of High Fertility, based on prior knowledge and literature. In order to make the data cross-sectional for our analysis, we averaged several variables of interest for each unique Client ID and used those values as our data points. After filtering, we had 123 unique Client IDs remaining.
library(readr)
library(dplyr)
# import data
fertility <- read_csv("FedCycleData071012.csv")
# select variables of interest
fertility0 <- fertility[,c("ClientID","FirstDayofHigh","LengthofCycle",
"LengthofLutealPhase","LengthofMenses",
"Age","BMI","Numberpreg")]
fertility0# A tibble: 1,665 × 8
ClientID FirstDayofHigh LengthofCycle LengthofLutealPhase LengthofMenses
<chr> <dbl> <dbl> <dbl> <dbl>
1 nfp8122 12 29 12 5
2 nfp8122 13 27 12 5
3 nfp8122 NA 29 14 5
4 nfp8122 13 27 12 5
5 nfp8122 12 28 12 5
6 nfp8122 10 26 11 5
7 nfp8122 NA 29 13 5
8 nfp8122 9 24 10 4
9 nfp8122 9 28 12 6
10 nfp8122 13 28 11 5
# ℹ 1,655 more rows
# ℹ 3 more variables: Age <dbl>, BMI <dbl>, Numberpreg <dbl>
# make cross-sectional
fertility1 <- fertility |>
group_by(ClientID) |>
summarize(
avg_FirstDayofHigh = mean(FirstDayofHigh,na.rm=TRUE),
avg_LengthofCycle = mean(LengthofCycle,na.rm=TRUE),
avg_LengthofLutealPhase = mean(LengthofLutealPhase,na.rm=TRUE),
avg_LengthofMenses = mean(LengthofMenses,na.rm=TRUE),
Age,
BMI,
Numberpreg) |>
filter(!is.na(Age) &
!is.na(BMI) &
!is.na(avg_FirstDayofHigh)&
!is.na(avg_LengthofLutealPhase))|>
distinct()
n_distinct(fertility1$ClientID)[1] 123
Exploratory Analysis
We conducted univariate and bivariate exploratory analyses on our variables of interest to investigate the appropriateness of a multiple linear regression model, locate possible outliers, and find any underlying patterns. The univariate analysis of our dependent variable, first day of high fertility, shows one possible outlier. The bivariate analysis does not show any concerning patterns for now.
library(ggplot2)
library(patchwork)
# univariate analysis
summary(fertility1$avg_FirstDayofHigh) Min. 1st Qu. Median Mean 3rd Qu. Max.
6.800 9.955 11.500 12.003 13.614 22.000
summary(fertility1$avg_LengthofCycle) Min. 1st Qu. Median Mean 3rd Qu. Max.
25.15 27.23 29.50 29.58 31.15 39.00
summary(fertility1$avg_LengthofLutealPhase) Min. 1st Qu. Median Mean 3rd Qu. Max.
6.00 12.03 13.33 13.17 14.24 20.00
summary(fertility1$avg_LengthofMenses) Min. 1st Qu. Median Mean 3rd Qu. Max.
2.400 4.582 5.200 5.309 5.962 9.000
summary(fertility1$Age) Min. 1st Qu. Median Mean 3rd Qu. Max.
21.00 26.00 31.00 30.73 35.50 42.00
summary(fertility1$BMI) Min. 1st Qu. Median Mean 3rd Qu. Max.
16.83 20.99 23.57 25.06 27.43 49.92
summary(fertility1$Numberpreg) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 2.00 2.52 4.00 10.00
ggplot(fertility1) +
geom_boxplot(aes(y=avg_FirstDayofHigh),
fill = "lightpink") +
labs(y = "Average First Day of High Fertility",
title = "Distribution of First Day of High Fertility")+
theme_minimal()+
theme(plot.title = element_text(hjust = 0.5, face = "bold"))# bivariate analysis
p1 <- ggplot(fertility1, aes(x=avg_LengthofCycle,y=avg_FirstDayofHigh)) +
geom_point() +
geom_smooth() +
labs(x = "Average Length of Cycle", y = "Average First Day of High Fertility")
p2 <- ggplot(fertility1, aes(x=avg_LengthofLutealPhase,y=avg_FirstDayofHigh)) +
geom_point() +
geom_smooth() +
labs(x = "Average Length of Luteal Phase", y = "Average First Day of High Fertility")
p3 <- ggplot(fertility1, aes(x=avg_LengthofMenses,y=avg_FirstDayofHigh)) +
geom_point() +
geom_smooth() +
labs(x = "Average Length of Menses", y = "Average First Day of High Fertility")
p4 <- ggplot(fertility1, aes(x=Age,y=avg_FirstDayofHigh)) +
geom_point() +
geom_smooth() +
labs(x = "Age", y = "Average First Day of High Fertility")
p5 <- ggplot(fertility1, aes(x=BMI,y=avg_FirstDayofHigh)) +
geom_point() +
geom_smooth() +
labs(x = "BMI", y = "Average First Day of High Fertility")
p6 <- ggplot(fertility1, aes(x=Numberpreg,y=avg_FirstDayofHigh)) +
geom_point() +
geom_smooth() +
labs(x = "Number of Pregnancies", y = "Average First Day of High Fertility")
(p1 + p2 + p3) / (p4 + p5 + p6)Building the Model
We decided on a multiple linear regression model with the first day of high fertility as the dependent variable, and length of cycle, length of luteal phase, length of menses, age, BMI, and number of pregancies as our independent variables. The model is statistically significant with F(6,116) = 19.78 and p < 0.001. The adjusted R squared value = 0.4801, which means that 48.01% of the variation in the first day of high fertility is explained by the variables in the model. Two explanatory variables - length of cycle and length of luteal phase - were statistically significant predictors.
f_model <- lm(avg_FirstDayofHigh ~ avg_LengthofCycle + avg_LengthofLutealPhase +
avg_LengthofMenses + Age + BMI + Numberpreg,
data = fertility1)
summary(f_model)
Call:
lm(formula = avg_FirstDayofHigh ~ avg_LengthofCycle + avg_LengthofLutealPhase +
avg_LengthofMenses + Age + BMI + Numberpreg, data = fertility1)
Residuals:
Min 1Q Median 3Q Max
-9.1388 -0.8475 -0.0877 1.4378 5.7112
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.38089 2.92557 -0.472 0.638
avg_LengthofCycle 0.65009 0.06705 9.696 < 2e-16 ***
avg_LengthofLutealPhase -0.51260 0.09038 -5.672 1.05e-07 ***
avg_LengthofMenses 0.07300 0.17581 0.415 0.679
Age 0.04094 0.04862 0.842 0.401
BMI -0.03188 0.03357 -0.950 0.344
Numberpreg 0.02307 0.11105 0.208 0.836
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.058 on 116 degrees of freedom
Multiple R-squared: 0.5057, Adjusted R-squared: 0.4801
F-statistic: 19.78 on 6 and 116 DF, p-value: 8.337e-16
Model Adequacy
We assessed our model for the following multiple linear regression assumptions:
Independence of Residuals - this was addressed when we filtered our original data. Including only one observation per Client ID ensures that all of our observations are independent from one another.
Linearity/Correct Model Specification - this is assessed using a combination of our exploratory analysis and a residual scatterplot. The scatterplot looks relatively random around zero, aside from one outlier.
Normality of Residuals - this is assessed using a QQ plot of residuals, histogram of residuals, and a Shapiro-Wilk test of normality. The QQ plot shows residuals closely following the straight line except at the tails. This is typical for a QQ plot and supports the assumption for normality. The histogram of residuals has an approximate bell-curve shape, which also supports the assumption for normality. The Shapiro-Wilk test was significant, which is evidence against normality. Our evidence points in different directions. Multiple linear regression models are robust to small violations of normality, though. For future studies, it might be worth looking into possible transformations to fix normality issues.
Equal Variance of Residuals (Homoscedasticity) - this is assessed using a residual scatterplot and a Breusch-Pagan test of homoscedasticity. The residual scatterplot does not show any obvious pattern (like coning), but the Bruesch-Pagan test was significant. This indicates a possible violation of constant variance. Again, future studies could look into transformations for fixing this assumption.
Residual Mean of Zero - this is assessed using a histogram of residuals. The histogram looks approximately centered around zero, which supports this assumption.
No Multicollinearity - this is assessed using VIF values. All values are less than 5, which indicates that the variables are not highly correlated with one another. This supports the assumption that there is no multicollinearity in the model.
library(ggpubr)
library(rstatix)
library(lmtest)
library(car)
# histogram of residuals
hist(f_model$residuals, main = " Distribution of Residuals Histogram",
xlab= "Residuals",
ylab= "Frequency",
col="pink",)# QQ plot of residuals
qq_data <- data.frame(resid = residuals(f_model))
ggplot(qq_data, aes(sample = resid)) +
stat_qq(color = "pink") +
stat_qq_line(color = "red") +
labs(title = "QQ Plot",
x = "Theoretical Quantiles",
y = "Standardized Residuals") +
theme_minimal()# residual scatterplot
f_resids <- tibble(
fitted = f_model$fitted.values,
residuals = f_model$residuals)
f_resids |>
ggplot(aes(x = fitted, y = residuals)) +
geom_point(color = "pink") +
geom_smooth(se = FALSE, color = "red") +
theme_minimal() +
labs(title = "Residuals vs. Fitted Values",
x = "Fitted Values",
y = "Residuals")#Shapiro-Wilk test
f_model$residuals |>
rstatix::shapiro_test()# A tibble: 1 × 3
variable statistic p.value
<chr> <dbl> <dbl>
1 f_model$residuals 0.936 0.0000198
#Bruesch-Pagan test
lmtest::bptest(f_model)
studentized Breusch-Pagan test
data: f_model
BP = 18.777, df = 6, p-value = 0.004558
#VIF values
vif(f_model) avg_LengthofCycle avg_LengthofLutealPhase avg_LengthofMenses
1.127799 1.024602 1.048986
Age BMI Numberpreg
2.142235 1.028176 2.011328
Cluster Analysis
We conducted cluster analyses on our variables of interest to explore subgroups with different reproductive profiles which might lead to different fertility timing. Target variable FirstDayofHigh and ClientID are removed in this part and the remaining variables are standardized. Even though 2 and 5 cluster groups are the best number of cluster with 7 votes. We pick 5 since 2 is a very low number of cluster that could not capture finer patterns of the data.
library(dplyr)
library(cluster)
library(NbClust)
# extract data
fertilitydata<-fertility1[, c("avg_LengthofCycle",
"avg_LengthofLutealPhase",
"avg_LengthofMenses",
"Age",
"BMI",
"Numberpreg")]
#standardize data
cluster <- scale(fertilitydata)
set.seed(123)
#The optimum number of clusters
cluster1 <- NbClust(cluster,
distance= "euclidean",
min.nc= 2, max.nc= 10,
method= "kmeans")*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 7 proposed 2 as the best number of clusters
* 3 proposed 3 as the best number of clusters
* 7 proposed 5 as the best number of clusters
* 2 proposed 7 as the best number of clusters
* 1 proposed 9 as the best number of clusters
* 3 proposed 10 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 2
*******************************************************************
#Show votes for best number of clusters
cluster1$Best.nc KL CH Hartigan CCC Scott Marriot TrCovW
Number_clusters 10.0000 2.0000 5.0000 2.000 5.0000 5 3.000
Value_Index 8.8393 39.2476 12.6526 -0.374 89.1693 441361912638 1670.576
TraceW Friedman Rubin Cindex DB Silhouette Duda
Number_clusters 5.0000 7.0000 5.0000 9.0000 10.0000 2.0000 2.0000
Value_Index 42.2635 1.8863 -0.1593 0.3027 1.4393 0.2268 1.0817
PseudoT2 Beale Ratkowsky Ball PtBiserial Frey McClain
Number_clusters 2.0000 2.0000 3.0000 3.0000 5.0000 1 2.0000
Value_Index -5.6634 -0.2858 0.3135 113.9738 0.4373 NA 0.7801
Dunn Hubert SDindex Dindex SDbw
Number_clusters 7.0000 0 5.0000 0 10.0000
Value_Index 0.1731 0 1.4472 0 0.4551
#k means cluster analysis
cluster2 <- pam(cluster, k = 5)
# Show medoids for the clusters
cluster2$medoids avg_LengthofCycle avg_LengthofLutealPhase avg_LengthofMenses Age
[1,] -1.0450615 -0.2431511 -0.0540231631 0.40440661
[2,] 0.1931222 -0.4428324 -0.0008672856 -1.37846123
[3,] 1.1011236 0.5700962 0.6370032446 0.04783304
[4,] 0.5363380 -0.2141066 0.0227575488 0.76098017
[5,] -0.5141631 -0.4889127 -0.8371864252 0.40440661
BMI Numberpreg
[1,] -0.51817286 0.2015962
[2,] -0.04754802 -1.0592343
[3,] -0.43280653 -0.2186806
[4,] 0.85148119 0.6218730
[5,] 0.13453384 1.4624267
#The cluster assignment for each row
cluster2$clustering [1] 1 2 1 2 1 3 3 2 2 4 4 5 4 5 4 3 2 2 4 4 1 3 2 1 5 5 2 2 2 4 2 5 5 5 3 2 5
[38] 5 4 1 5 2 1 3 4 1 1 4 2 3 4 2 2 2 2 1 1 4 2 4 4 1 3 3 4 2 5 1 5 4 1 1 5 1
[75] 3 1 2 1 1 2 3 1 2 1 2 3 3 3 5 1 2 3 2 5 3 2 5 2 1 2 3 2 2 2 3 3 2 3 2 1 2
[112] 3 3 3 3 2 1 4 3 5 3 5 1
#Plot the model
plot(cluster2)#mutate cluster to dataset and compute averages
fertilitygroup <- fertilitydata |>
mutate(cluster = cluster2$clustering)
fertilitygroup# A tibble: 123 × 7
avg_LengthofCycle avg_LengthofLutealPhase avg_LengthofMenses Age BMI
<dbl> <dbl> <dbl> <dbl> <dbl>
1 25.9 12.1 5.31 33 25.8
2 28.8 11.2 4.67 23 25.0
3 27.7 11.6 6 39 21.5
4 26.8 13.4 5.62 23 22.5
5 27.2 13.1 5.33 30 21.9
6 32.8 14.4 6 31 22.6
7 33.5 15 5.5 24 24.1
8 32.6 10.8 5.2 27 18.8
9 29 14.4 4.62 24 19.2
10 30.4 13.7 5.45 35 30.0
# ℹ 113 more rows
# ℹ 2 more variables: Numberpreg <dbl>, cluster <int>
fertilitysummary <- fertilitygroup |>
group_by(cluster) |>
summarise(across(everything(), ~ round(mean(.x, na.rm = TRUE), 2)))
fertilitysummary# A tibble: 5 × 7
cluster avg_LengthofCycle avg_LengthofLutealPhase avg_LengthofMenses Age
<int> <dbl> <dbl> <dbl> <dbl>
1 1 27.1 13.3 5.25 34.3
2 2 29.8 12.0 5.35 24.8
3 3 32.8 14.9 5.85 29.4
4 4 30.0 13.5 5.61 34.8
5 5 27.7 12.4 4.25 35.6
# ℹ 2 more variables: BMI <dbl>, Numberpreg <dbl>
Interactive Figures
library(plotly)
plot_ly(
data=fertility1,
x = ~avg_LengthofCycle,
y=~avg_FirstDayofHigh,
type="scatter",
mode="markers",
marker = list(color = "lightpink")
)%>%
layout(xaxis=list(title="Average Length of Cycle"),
yaxis=list(title="Average First Day of High Fertility")
)plot_ly(
data=fertility1,
x = ~avg_LengthofLutealPhase,
y=~avg_FirstDayofHigh,
type="scatter",
mode="markers",
marker = list(color = "lightpink")
)%>%
layout(xaxis=list(title="Average Length of Luteal Phase"),
yaxis=list(title="Average First Day of High Fertility"))Interactive Table
library(DT)
fertility2 <- fertility1 |>
mutate(across(where(is.numeric), ~ round(.x, 3)))
datatable(fertility2)Predictive Model
We created a predictive function that takes two inputs:
x = length of cycle
y = length of luteal phase
These two inputs are the significant predictors in our model. The rest of the variables in the model are set to the average, since they are not as statistically significant. This function takes the input of x and y and predicts the first day of high fertility in the cycle of a woman with that length of cycle and length of luteal phase. Since prediction should be limited to the range of the sample, the function returns a message if x or y values go outside of the ranges covered in the sample data.
predicted_highdayfert <- function(x,y) {
day <- round(((x*0.65009) - (y*0.51260) - 0.6343843), digits = 0)
if (x < 24 | x > 40 | y < 6 | y > 20) {
print("One or both of your values are outside the predictive range of this model.")
} else {
print(paste("Your predicted first day of high fertility is Day",day,"of your cycle."))
}
}
# example 1:
predicted_highdayfert(28,13)[1] "Your predicted first day of high fertility is Day 11 of your cycle."
# example 2:
predicted_highdayfert(42,20)[1] "One or both of your values are outside the predictive range of this model."
Conclusion
This study identify two significant biologicalstatistically variables, namely length of Cycle, length of luteal phase. The remaining variables barely contributed to our dependent variable. Based the nmultiple linear regression model, we are able to predict the first day of high fertility of women who participated in the Creighton Model natural family planning method with adjusted R squared of 0.4801 = 48,01%. We divided 5 biological sub-group to explore different reproductive profiles. Notably, cluster group 2 (women at age ~25, BMI ~23.5,Numberpreg ~0.17) youngest demographic with lowest number of pregnancy and lowest length of luteal phase cluster 3 (women at age ~29, BMI ~22.9,Numberpreg ~2) has highest length of Cycle, length of Luteal Phase,length of Menses. cluster group 1 (women at age ~34, BMI ~23.1,Numberpreg ~3) has lowest length of cycle cluster group 4 (women at age ~35, BMI ~33.8,Numberpreg ~4) has highest BMI. cluster group 5 (women at age ~36, BMI ~25.9,Numberpreg ~6) oldest demographic with lowest length of menses. These patterns may reflect underlying biological differences in reproductive aging across groups. ###Limitation dataset did not record demographic data ###Future research
KSU Interactive Map & Contacts
Map
library(leaflet)
leaflet() %>%
addTiles()%>%
addMarkers(
lng=-84.5184,
lat=33.9384,
popup="Kennesaw State University - Marietta Campus"
)Contacts
Emily Holzgrefe - eholzgre@kennesaw.edu
Vy Luu - vluu1@students.kennesaw.edu