knitr::opts_chunk$set(echo = TRUE)
options(warn=-1)
# set wd
setwd('/Users/dpong/Data 606/Project')
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
##
## col_factor
require(data.table)
## Loading required package: data.table
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
# install.packages("GGally")
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(DATA606)
## Loading required package: shiny
## Loading required package: openintro
## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
## The following object is masked from 'package:ggplot2':
##
## diamonds
## The following objects are masked from 'package:datasets':
##
## cars, trees
## Loading required package: OIdata
## Loading required package: RCurl
## Loading required package: maps
## Loading required package: markdown
##
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics
## This package is designed to support this course. The text book used
## is OpenIntro Statistics, 3rd Edition. You can read this by typing
## vignette('os3') or visit www.OpenIntro.org.
##
## The getLabs() function will return a list of the labs available.
##
## The demo(package='DATA606') will list the demos that are available.
##
## Attaching package: 'DATA606'
## The following object is masked from 'package:utils':
##
## demo
For the sake of loading the data more quickly, I am going to be commenting out most of the datasets that are not going to be relevant in answering questions.
# load data
# commented out a lot of columns that are related to preterm births that I originally thought would
# include in this observational studies as part 2.
df <- read_fwf('Nat2018PublicUS.c20190509.r20190717.txt',
fwf_cols(
# DOB_YY = c(9, 12),
# DOB_MM = c(13, 14),
# DOB_TT = c(19, 22),
# BFACIL = c(32, 32),
# MAGE_IMPFLG = c(73, 73),
# MAGE_REPFLG = c(74, 74),
MAGER = c(75, 76), # Mother’s Single Years of Age
MBSTATE_REC = c(84, 84), # Mother’s Nativity
MRACE31 = c(105, 106), # Mother’s Race Recode 31
MRACE15 = c(108, 109), # Mother’s Race Recode 15
# MAR_P = c(119, 119),
# DMAR = c(120, 120),
MEDUC = c(124, 124), # Mother’s Education
# FAGECOMB = c(147, 148),
# FAGERPT_FLG = c(142, 142),
# FRACE31 = c(151, 152),
# FRACE15 = c(154, 155),
# FEDUC = c(163, 163),
# COMPGST_IMP = c(488, 488),
# OBGEST_FLG = c(489, 489),
# COMBGEST = c(490, 491),
# GESTREC10 = c(492, 493),
# GESTREC3 = c(494, 494),
# OEGest_Comb = c(499, 500),
# DBWT = c(504, 507),
# SEX = c(475, 475),
DPLURAL = c(454, 454), # Plurality Recode
# PAY_REC = c(436, 436),
# RDMETH_REC = c(407, 407),
# DMETH_REC = c(408, 408),
# ME_ROUT = c(402, 402),
# WTGAIN = c(304, 305),
# WTGAIN_REC = c(306, 306),
# DWgt_R = c(299, 301),
# PWgt_R = c(292, 294),
# BMI_R = c(287, 287),
# M_Ht_In = c(280, 281),
# CIG_3 = c(259, 260), # Cigarettes 3rd Trimester (not including)
# CIG_2 = c(257, 258), # Cigarettes 2nd Trimester (not including)
# CIG_1 = c(255, 256), # Cigarettes 1st Trimester (not including)
# WIC = c(251, 251), # The Special Supplemental Nutrition Program for Women, Infants,
# and Children (not including)
# PREVIS = c(238, 239),
# PRECARE5 = c(277, 277),
# PRECARE = c(224, 225),
# RF_PPTERM = c(318, 318),
# RF_GDIAB = c(314, 314),
# RF_PDIAB = c(313, 313),
RF_INFTR = c(325, 325), # Infertility Treatment Used
RF_FEDRG = c(326, 326), # Fertility Enhancing Drugs
RF_ARTEC = c(327, 327) # Asst. Reproductive Technology
) ,
skip = 1, skip_empty_rows = TRUE
)
head(df, 1)
df %>% filter (DPLURAL != 1)
There are essentially one overall question that I wanted to answer with this project. Would the age of mother affect the chances of having a multiple pregnancy (vs. singleton pregnancy)?
Secondarily, I’d like to see if the intake of fertility enhancing drug and age of mother has any relationship with the outcome of having a multiple pregnancy (vs. singleton pregnancy)?
summary(df)
## MAGER MBSTATE_REC MRACE31 MRACE15
## Min. :12.00 Min. :1.000 Length:3801533 Length:3801533
## 1st Qu.:25.00 1st Qu.:1.000 Class :character Class :character
## Median :29.00 Median :1.000 Mode :character Mode :character
## Mean :29.01 Mean :1.233
## 3rd Qu.:33.00 3rd Qu.:1.000
## Max. :50.00 Max. :3.000
## MEDUC DPLURAL RF_INFTR RF_FEDRG
## Min. :1.000 Min. :1.000 Length:3801533 Length:3801533
## 1st Qu.:3.000 1st Qu.:1.000 Class :character Class :character
## Median :4.000 Median :1.000 Mode :character Mode :character
## Mean :4.413 Mean :1.034
## 3rd Qu.:6.000 3rd Qu.:1.000
## Max. :9.000 Max. :5.000
## RF_ARTEC
## Length:3801533
## Class :character
## Mode :character
##
##
##
nrow(df)
## [1] 3801533
What are the cases, and how many are there?
Each observation represents a birth record (or attempt), or more precisely, vital event data, in 2018 in US. There are 3801533 records in this dataset. Of which, multiple pregnancy ties to 127419 records.
Describe the method of data collection. As data is provided by CDC’s national center for Health Statistics. I believe the data collection happened at the institutional level.
df %>% count(DPLURAL)
What’s multiple Pregnancy?
A multiple pregnancy is a pregnancy with 2 or more fetuses. Some names for these are:
Defined the variable twins as a recoding of the variable DPLURAL, which classified what kind of pregnancy is it really. Single, Twin, Triplet, Quadruplet, and Quintuplet or higher are coded for 1, 2, 3, 4, 5 respectively. Having at least a twin (2+ in multiple pregnancy) can then be easier defined by the logic of having DPLURAL greater than or equal to 2.
df %>% mutate (twins = (DPLURAL >= 2)) %>% group_by (MAGER, twins) %>% summarise(n=n()) %>%
group_by(MAGER) %>% mutate(Percentage=round(n/sum(n)*100,2))
On the other hand, fertility enhancing drugs is coded in the variable RF_FEDRG
df %>% group_by(RF_FEDRG) %>% tally()
N: No Y: Yes X: Not applicable U: Unknown or not stated
So, for the purpose of this study, in the EDA, I’m just going to include Y and N.
df1 <- df %>%
mutate (twins = (DPLURAL >= 2)) %>%
count(MAGER, twins) %>%
group_by(MAGER) %>%
transmute(twins, Percentage=round(n/sum(n)*100,2))
df1
# min of mother's age
min(df1$MAGER)
## [1] 12
# max of mother's age
max(df1$MAGER)
## [1] 50
par(mar = c(3.9, 3.9, 0.5, 0.5), las = 0, mgp = c(2.7, 0.7, 0),
cex.lab = 1.5, cex.axis = 1.5, las = 1)
ggplot(df1, aes(fill=twins, y=Percentage, x=MAGER)) +
geom_bar(position="stack", stat="identity") +
scale_x_continuous(breaks = pretty(df1$MAGER, n = 39), name = "Mother’s Single Years of Age",
labels = c("10-12", seq(13, 49, 1), "50+")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + # changed the angle of the xlabels to
# 45 degree
labs(fill = "Multiple Pregnancies") +
scale_fill_discrete(labels = c("No", "Yes"))
As you can see, the percentage breakdwon of having twins (or more) has a percentage that is increasing as age of mother increases.
Secondarily, let’s look at the effect of fertility-enhancing drug.
df2 <- df %>% subset(RF_FEDRG %in% c('Y', 'N')) %>%
mutate (fert_treat = RF_FEDRG) %>%
mutate (twins = (DPLURAL >= 2)) %>%
count(MAGER, twins, fert_treat) %>%
group_by(MAGER, twins, fert_treat) %>%
transmute(counts = sum(n))
df2
# min for MAGER
min(df2$MAGER)
## [1] 17
# max for MAGER
max(df2$MAGER)
## [1] 50
# note that the min age is 17 and the maximum age is 50+. So we have only 34 bins (n)
par(mar = c(3.9, 3.9, 0.5, 0.5), las = 0, mgp = c(2.7, 0.7, 0),
cex.lab = 1.5, cex.axis = 1.5, las = 1)
ggplot(df2, aes(fill=twins, y= counts, x=MAGER)) +
geom_bar(position="stack", stat="identity") +
scale_x_continuous(breaks = pretty(df2$MAGER, n = 34), name = "Mother’s Single Years of Age",
labels = c(seq(17, 49, 1), "50+")) +
ylab("relative frequencies") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + # changed the angle of the xlabels to
# 45 degree
facet_grid(~c("Did Not Use Fertility-enhancing Drug", "Used Fertility-enhancing Drug"), scales =
"fixed") +
labs(fill = "Multiple Pregnancy") +
scale_fill_discrete(labels = c("No", "Yes"))
It’s not hard to imagine that mother’s distribution of mother’s by age is normally distributed.
df2a <- df %>% subset(RF_FEDRG %in% c('Y', 'N')) %>%
mutate (fert_treat = RF_FEDRG) %>%
mutate (twins = (DPLURAL >= 2)) %>%
count(MAGER, twins, fert_treat) %>%
group_by(MAGER) %>%
transmute(twins, fert_treat, percentage = round(n/sum(n)*100,2))
df2a
par(mar = c(3.9, 3.9, 0.5, 0.5), las = 0, mgp = c(2.7, 0.7, 0),
cex.lab = 1.5, cex.axis = 1.5, las = 1)
ggplot(df2a, aes(fill=twins, y= percentage, x=MAGER)) +
geom_bar(position="stack", stat="identity") +
geom_text(aes( label = scales::percent(percentage/100),
y= percentage ), stat= "identity", angle = 90, position = position_dodge(1)) +
scale_x_continuous(breaks = pretty(df2a$MAGER, n = 34), name = "Mother’s Single Years of Age",
labels = c(seq(17, 49, 1), "50+")) +
ylab("Percent") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + # changed the angle of the xlabels to
# 45 degree
facet_grid(vars (c("Did Not Use Fertility-enhancing Drug", "Used Fertility-enhancing Drug")),
scales = "free_y") +
labs(fill = "Multiple Pregnancy") +
scale_fill_discrete(labels = c("No", "Yes"))
One can tell, at a given age of mother, from the bar charts above, what’s the percentages of mothers that receive treatments. Take age 39 as an example, 15% of mothers did not receive fertility treatment and ended up having, at least, a twin pregnancy (or multiple pregnancy) while 51% of mothers did not receive fertility treatment and ended up having single pregnancy. On the other hand, 7% of mothers did receive fertility treatment and ended up having multiple pregnancy while 27% did receive treatment and ended up having single pregnancy.
Let’s repeat this exercise for RF_INFTR (Infertility Treatment Used) and RF_ARTEC (Assisted Reproductive Technology).
df2b <- df %>% subset(RF_INFTR %in% c('Y', 'N')) %>%
mutate (infert_treat = RF_INFTR) %>%
mutate (twins = (DPLURAL >= 2)) %>%
count(MAGER, twins, infert_treat) %>%
group_by(MAGER) %>%
transmute(twins, infert_treat, percentage = round(n/sum(n)*100,2))
df2b
# min for MAGER
min(df2b$MAGER)
## [1] 12
# max for MAGER
max(df2b$MAGER)
## [1] 50
par(mar = c(3.9, 3.9, 0.5, 0.5), las = 0, mgp = c(2.7, 0.7, 0),
cex.lab = 1.5, cex.axis = 1.5, las = 1)
ggplot(df2b, aes(fill=twins, y= percentage, x=MAGER)) +
geom_bar(position="stack", stat="identity") +
geom_text(aes( label = scales::percent(percentage/100),
y= percentage ), stat= "identity", angle = 90, position = position_dodge(1)) +
scale_x_continuous(breaks = pretty(df2b$MAGER, n = 35), name = "Mother’s Single Years of Age",
labels = c("10-12", seq(13, 49, 1), "50+")) +
ylab("Percent") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + # changed the angle of the xlabels to
# 45 degree
facet_grid(~infert_treat) +
labs(fill = "Multiple Pregnancy") +
scale_fill_discrete(labels = c("No", "Yes"))
The Y/N above signifies whether the mother received infertility treatment. For those that have received infertility treatment, it seems there is a higher percentage of mothers would wind up having multiple pregnacy.
df2c <- df %>% subset(RF_ARTEC %in% c('Y', 'N')) %>%
mutate (assist_rtech = RF_ARTEC) %>%
mutate (twins = (DPLURAL >= 2)) %>%
count(MAGER, twins, assist_rtech) %>%
group_by(MAGER) %>%
transmute(twins, assist_rtech, percentage = round(n/sum(n)*100,2))
df2c
# min for MAGER
min(df2c$MAGER)
## [1] 17
# max for MAGER
max(df2c$MAGER)
## [1] 50
par(mar = c(3.9, 3.9, 0.5, 0.5), las = 0, mgp = c(2.7, 0.7, 0),
cex.lab = 1.5, cex.axis = 1.5, las = 1)
ggplot(df2c, aes(fill=twins, y= percentage, x=MAGER)) +
geom_bar(position="stack", stat="identity") +
geom_text(aes( label = scales::percent(percentage/100),
y= percentage ), stat= "identity", angle = 90, position = position_dodge(1)) +
scale_x_continuous(breaks = pretty(df2c$MAGER, n = 34), name = "Mother’s Single Years of Age",
labels = c(seq(17, 49, 1), "50+")) +
ylab("Percent") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + # changed the angle of the xlabels
# to 45 degree
facet_grid(vars (c("Did not use Assisted Reproductive Tech", "Used Assisted Reproductive Tech")),
scales = "free_y") +
labs(fill = "Multiple Pregnancy") +
scale_fill_discrete(labels = c("No", "Yes"))
The Y/N above signifies whether the mother used assisted reproductive technology. For those that have used this technology, it seems there is a higher percentage of mothers would wind up having multiple pregnacy, esp. over 45+ years old.
Conditions: 1) Independence 2) Normality N > 30. Satisfied.
Before we go into inferencing, we need to take a random sample from this observational data set.
df3 <- df[sample(nrow(df),.1*nrow(df)),]
df3
H0: Average Mother’s age for multiple pregnacy - average mother’s age for single pregnancy = 0
H1: Average Mother’s age for multiple pregnacy - average mother’s age for single pregnancy != 0
df4 <- df3 %>%
mutate (twins = (DPLURAL >= 2)) %>%
count(MAGER, twins) %>%
group_by(MAGER) %>%
transmute(twins, Percentage=round(n/sum(n)*100,2))
df4
df4$twins <- as.factor(df4$twins)
inference(y = df4$MAGER, x = df4$twins, type = "ht", null = 0,
alternative = "twosided", method = "theoretical", est = "mean")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_FALSE = 39, mean_FALSE = 31, sd_FALSE = 11.4018
## n_TRUE = 37, mean_TRUE = 32, sd_TRUE = 10.8244
## Observed difference between means (FALSE-TRUE) = -1
##
## H0: mu_FALSE - mu_TRUE = 0
## HA: mu_FALSE - mu_TRUE != 0
## Standard error = 2.55
## Test statistic: Z = -0.392
## p-value = 0.6948
The boxplots of TRUE or FALSE in multiple pregnancy show the two boxplots have approximately the same medians of mother’s age. The true group of twins, which signify there is multiple pregnacy, has a slightly higher median in mother’s age. That truthfully tells the story as we do see that trend in exploratory data analysis (EDA) stage. But there is no statistically significance in this hypothesis testing. So we conclude that statistically there is no difference between age of mothers for the group of multiple pregnancy vs. single pregnancy.
I really thought of designing the hypothesis testing based on a crude multiple pregnancy rate, which is to mean per 1000 people out of the random sample of 3K records, what are the rates (really proportions are meant here) of mothers with multiple pregnancy and mothers with single pregnancy (False on multiple pregnancy). I felt this is a standard way of looking at stuff. But for the purpose of clarity I am not changing gear from what I designed upstairs.
On the other hand, to follow up on the crude multiple pregnancy rate (per 1K mothers), there is the adjusted rate per each age group. Looking at the very first histogram we had up stairs, 40-50+ definitely has the highest adjusted rate of multiple pregnancy. It also shows the biggest proportions among all age groups. So my next step is to see if age of mothers and the use of fertility-enhancing drugs would have an effect, accounting for interaction terms as well, on multiple pregnancy, which are kind of illustrated somewhat in the charts above.
That is a good segway for me to bring in logistic regression to further understand if the 2 variables mentioned above are indeed contributing factors in predicting the outcome of multiple pregnancy (or single pregnancy).
# redefine df2
df2 <- df %>% subset(RF_FEDRG %in% c('Y', 'N')) %>%
mutate (fert_treat = RF_FEDRG) %>%
mutate (twins = (DPLURAL >= 2))
# again, let's sample df2
df2d <- df2[sample(nrow(df2),.1*nrow(df2)),]
df2d
# model to look at variables themselvles but no interaction terms
lr.out1 <- glm(twins ~ MAGER + fert_treat, data=df2d, family=binomial(link='logit'))
summary(lr.out1)
##
## Call:
## glm(formula = twins ~ MAGER + fert_treat, family = binomial(link = "logit"),
## data = df2d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8742 -0.7764 -0.7183 -0.6389 1.8770
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.397558 0.198907 -1.999 0.04564 *
## MAGER -0.017495 0.005497 -3.182 0.00146 **
## fert_treatY -0.300722 0.058839 -5.111 3.21e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7717.8 on 6938 degrees of freedom
## Residual deviance: 7687.2 on 6936 degrees of freedom
## AIC: 7693.2
##
## Number of Fisher Scoring iterations: 4
# model to include nested random effects
lr.out2 <- glm(twins ~ MAGER / fert_treat, data=df2d, family=binomial(link='logit'))
summary(lr.out2)
##
## Call:
## glm(formula = twins ~ MAGER/fert_treat, family = binomial(link = "logit"),
## data = df2d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8514 -0.7752 -0.7266 -0.6273 1.9044
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.539363 0.191722 -2.813 0.0049 **
## MAGER -0.013754 0.005368 -2.562 0.0104 *
## MAGER:fert_treatY -0.008165 0.001712 -4.770 1.84e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7717.8 on 6938 degrees of freedom
## Residual deviance: 7690.5 on 6936 degrees of freedom
## AIC: 7696.5
##
## Number of Fisher Scoring iterations: 4
Interpretation of the results: We can tell from the results of the logistic regression model that, while MAGER is not a stat sig. predictor for the outcome TWINS (multiple pregnancy Y (1) ; 0 here is N), fert_treat (fertility-enhancing drug usage) is a statistically signficant predictor of the outcome of multiple pregnancy or not. The random interaction effect of MAGER and fert_treat turns out to be a stat sig predictor as well. So what tells me is that people that tend to be using the fertility-enhancing drug are of a certain age range, it didn’t say it is the older folks who tend to be using the drug tho’.
Conclusion is obvious here that we can’t conclude that age is definitely a factor in contributing to the outcome of mothers having multiple pregnany (or single pregnancy). But the usage of fertility-enhancing drugs has a strong relationship with the outcome of multiple pregnancy. So if there is an evidence that can be proved in another study to link age (older moms) to the demand for usage for fertility-enhancing drugs, then it will wrap the whole story even better. It is because there are definitely signs of nested random effects of MAGER (the age of mothers) and fert_treat (fertility-enhancing drug usage) to the outcome of multiple pregnancy.