The project aims to explore the relation between expectant mothers’ habits and practices and their children’s birth based on “nc” datasets. In 2004, the state of North Carolina released an extensive data set containing information on all births recorded in their state.
Our main objective of this study is to find whether smoking status of mother during pregnancy time associate with low birth weight of baby or not.
We will investigate the association using essential statistical tools, such as ‘Descriptive Statistics’, ‘Data visualization’ (normality check, distribution of variables, histogram, and Boxplot), and ‘Quantitive Data Analysis’ by linear and nonlinear regression models.
Load the relevant packages: “statsr”, “nc”, “tidyverse” and “ggplot2”.
The statsr R package: The package
includes data sets, functions and Shiny Applications for learning
frequentist and Bayesian statistics with R.
The nc R package: for data and
custom functions with the statsr resources.
The tidyverse “umbrella” package which houses a suite of many different R packages: for data wrangling and data visualization.
The ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics.
The ggpubr package provides some easy-to-use functions for creating and customizing ‘ggplot2’- based publication ready plots
The ggridges package provides two main geoms, geom_ridgeline and geom_density_ridges. The former takes height values directly to draw ridgelines, and the latter first estimates data densities and then draws those using ridgelines.
The epitools tools for training and practicing epidemiologists including methods for two-way and multi-way contingency tables.
The lmtest A collection of tests, data sets, and examples for diagnostic checking in linear regression models. Furthermore, some generic tools for inference in parametric models are provided.
library(statsr)
library(tidyverse)
library(ggplot2)
library(ggpubr)
library(ggridges)
library(epitools)
library(lmtest)
The following command extends the number of lines of printing your results in the console.
options(scipen = 999)
In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. In this dataset we have 1000 observations.
Data Source: State of North Carolina.
By eyeballing of the output the data set (data frame
R name) contain with three columns and eighty two rows
1000 x 13.
data(nc)
We can also check the dimensions of this data frame as well as the
names of the variables, type of variables and the first few observations
by inserting the name of the data set into the glimpse()
function, as seen below:
glimpse(nc)
## Rows: 1,000
## Columns: 13
## $ fage <int> NA, NA, 19, 21, NA, NA, 18, 17, NA, 20, 30, NA, NA, NA,~
## $ mage <int> 13, 14, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16,~
## $ mature <fct> younger mom, younger mom, younger mom, younger mom, you~
## $ weeks <int> 39, 42, 37, 41, 39, 38, 37, 35, 38, 37, 45, 42, 40, 38,~
## $ premie <fct> full term, full term, full term, full term, full term, ~
## $ visits <int> 10, 15, 11, 6, 9, 19, 12, 5, 9, 13, 9, 8, 4, 12, 15, 7,~
## $ marital <fct> married, married, married, married, married, married, m~
## $ gained <int> 38, 20, 38, 34, 27, 22, 76, 15, NA, 52, 28, 34, 12, 30,~
## $ weight <dbl> 7.63, 7.88, 6.63, 8.00, 6.38, 5.38, 8.44, 4.69, 8.81, 6~
## $ lowbirthweight <fct> not low, not low, not low, not low, not low, low, not l~
## $ gender <fct> male, male, female, male, female, male, male, male, mal~
## $ habit <fct> nonsmoker, nonsmoker, nonsmoker, nonsmoker, nonsmoker, ~
## $ whitemom <fct> not white, not white, white, white, not white, not whit~
We have observations of 13 different variables, some categorical and some numerical. The meaning of each variable is as follows:
| variable | description |
|---|---|
| fage | father’s age in years. |
| mage | mother’s age in years. |
| mature | maturity status of mother. |
| weeks | length of pregnancy in weeks. |
| premie | whether the birth was classified as premature (premie) or full-term. |
| visits | number of hospital visits during pregnancy. |
| marital | whether mother is married or not married at birth. |
| gained | weight gained by mother during pregnancy in pounds. |
| weight | weight of the baby at birth in pounds. |
| lowbirthweight | whether baby was classified as low birthweight (low) or not (not low). |
| gender | gender of the baby, female or male. |
| habit | status of the mother as a nonsmoker or a smoker. |
| whitemom | whether mom is white or not white. |
As a first step in the analysis, we should consider summaries of the
data. This can be done using the summary command:
summary(nc)
## fage mage mature weeks premie
## Min. :14.00 Min. :13 mature mom :133 Min. :20.00 full term:846
## 1st Qu.:25.00 1st Qu.:22 younger mom:867 1st Qu.:37.00 premie :152
## Median :30.00 Median :27 Median :39.00 NA's : 2
## Mean :30.26 Mean :27 Mean :38.33
## 3rd Qu.:35.00 3rd Qu.:32 3rd Qu.:40.00
## Max. :55.00 Max. :50 Max. :45.00
## NA's :171 NA's :2
## visits marital gained weight
## Min. : 0.0 married :386 Min. : 0.00 Min. : 1.000
## 1st Qu.:10.0 not married:613 1st Qu.:20.00 1st Qu.: 6.380
## Median :12.0 NA's : 1 Median :30.00 Median : 7.310
## Mean :12.1 Mean :30.33 Mean : 7.101
## 3rd Qu.:15.0 3rd Qu.:38.00 3rd Qu.: 8.060
## Max. :30.0 Max. :85.00 Max. :11.750
## NA's :9 NA's :27
## lowbirthweight gender habit whitemom
## low :111 female:503 nonsmoker:873 not white:284
## not low:889 male :497 smoker :126 white :714
## NA's : 1 NA's : 2
##
##
##
##
According to the variables of the summaries, the minimum mother age was 13 years old. Those are signs of teenage pregnancy. We suspected there would be outliers for numerical variables. A component of the data visualization will be able to detect anomalies. There are a few variables that have missing values. In order to perform further analysis, the dataset needs to be cleaned.
Before the initial data analysis, we should check missing observations in the dataset. If there have a missing cell, we need to omit it.
sum(is.na(nc))
## [1] 215
The dataset have 215 total missing observations.
nc<-nc[complete.cases(nc),]
# Check again total missing observation
sum(is.na(nc)) # Missing observation is 0
## [1] 0
Now the dataset is full length, we may prceesed for further analysis.
R has some powerful functions for making graphics.
Let check the distrution of weight of the baby at birth in pounds by maturity status of mother and status of the mother as a nonsmoker or a smoker.
nc %>%
ggplot(aes(x=weight, fill=habit, color=habit))+
geom_histogram(position = "identity", alpha=0.5)+
labs(title = "Baby weight by mother habit", x="Baby weight at birth (in Pounds)", y="Count")+
theme(plot.title = element_text(hjust = 0.5))+
theme(legend.position="bottom")
In the figure above, we can see that non-smokers give birth to babies of greater weight than smokers. It is consistent with our study objective that smoker mothers are associated with give low birth weight babies.
Baby weight by maturity status of mother and gender of the baby
Maturity status of mother adds another dimension to our analysis. We would like to identify if there exists any relationship between mature mother with baby weight at birth. We can create Ridge Plots to identify such relationships.
nc %>%
ggplot(aes(x=weight, y=mature, fill=mature))+
geom_density_ridges()+
facet_grid(gender~habit, labeller=labeller(.cols=label_both))+
labs(title = "Cluster 2", x="Baby weight at birth (in Pounds)", y="Maturity status of mother")+
scale_color_viridis_c()+
theme_bw()+
theme(plot.title = element_text(hjust = 0.5))
The Ridge Plots above further reaffirm our hypothesis. We can immediately take notice of the fact that counties that saw population loss had no representation of individuals with Bachelors’ degrees in the median household income distribution, while on the flip side, counties that saw population gain had a larger share of the median household income distribution made up of those who had a Bachelors’ degrees (The results are consistent with Cluster_1). Gained-metro and Bachelors’ degrees distribution surprisingly shown tri-models pattern.
The Ridge Plots above further reaffirm our hypothesis. We can immediately take notice of the fact that smoker mon give low weight birth baby as it is shown left skweed (botom-right panel), while on the flip side, non-smoker mom had given large number of baby weight.
To better visualization Boxplot may be one of the another powerful tool.
Box plots are another great way to visualize the heterogeneity in baby weight at birth.
nc %>%
ggplot(aes(x=weight, y= habit,
color=habit))+
geom_boxplot()+
labs(title = "A", x="Baby weight at birth (in Pounds)", y="Count")+
theme(plot.title = element_text(hjust = 0.5))+
theme(legend.position="right")
As expect, we see that nonsmoker has a lot more variance when it comes to weight. This is in addition to a higher median weight.
The box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the following function to split the weight variable into the habit groups, then take the mean of each using the mean function.
by(nc$weight, nc$habit, mean)
## nc$habit: nonsmoker
## [1] 7.24676
## ------------------------------------------------------------
## nc$habit: smoker
## [1] 6.886429
There is an observed difference, but is this difference statistically significant? In order to answer this question we will conduct a hypothesis test in the main analysis part.
Before going to building statistical model test of hypothesis one of the importance part of analysis.
H(0): Mean(weight | smoker) = Mean (weight | nonsmoker)
H(A): Mean(weight | smoker) <> Mean (weight | nonsmoker)
Next, we introduce a new function, inference, that we
will use for conducting hypothesis tests and constructing confidence
intervals.
inference(y = weight, x = habit, data = nc, statistic = "mean", type = "ht", null = 0, alternative = "twosided", method = "theoretical")
## Response variable: numerical
## Explanatory variable: categorical (2 levels)
## n_nonsmoker = 716, y_bar_nonsmoker = 7.2468, s_nonsmoker = 1.4521
## n_smoker = 84, y_bar_smoker = 6.8864, s_smoker = 1.3064
## H0: mu_nonsmoker = mu_smoker
## HA: mu_nonsmoker != mu_smoker
## t = 2.3625, df = 83
## p_value = 0.0205
Let’s pause for a moment to go through the arguments of this custom
function. The first argument is y, which is the response
variable that we are interested in: weight. The second
argument is the explanatory variable, x, which is the
variable that splits the data into two groups, smokers and non-smokers:
habit. The third argument, data, is the data
frame these variables are stored in. Next is statistic,
which is the sample statistic we’re using, or similarly, the population
parameter we’re estimating. In future labs we can also work with
“median” and “proportion”. Next we decide on the type of
inference we want: a hypothesis test ("ht") or a confidence
interval ("ci"). When performing a hypothesis test, we also
need to supply the null value, which in this case is
0, since the null hypothesis sets the two population means
equal to each other. The alternative hypothesis can be
"less", "greater", or "twosided".
Lastly, the method of inference can be
"theoretical" or "simulation" based.
For more information on the inference function see the help file with ?inference.
As the p-value is smaller than the usual confidence level 0.05, we reject the null hypotheses. This means this data set provides enought evidence that the average weights of babies born to smoking and non-smoking mothers are different. We may proceed for further analysis.
You can test for the statistical relationship between two categorical
variables using the chisq.test() function. At it’s base, it
takes two variables as input.
oddsratio(table(nc$habit, nc$lowbirthweight))
## $data
##
## low not low Total
## nonsmoker 62 654 716
## smoker 12 72 84
## Total 74 726 800
##
## $measure
## odds ratio with 95% C.I.
## estimate lower upper
## nonsmoker 1.0000000 NA NA
## smoker 0.5640745 0.2986628 1.148773
##
## $p.value
## two-sided
## midp.exact fisher.exact chi.square
## nonsmoker NA NA NA
## smoker 0.1100541 0.1088812 0.09221694
##
## $correction
## [1] FALSE
##
## attr(,"method")
## [1] "median-unbiased estimate & mid-p exact CI"
The magnitude of the odds ratio is called the “strength of the association.” The further away an odds ratio is from 1.0, the more likely the causal relationship between the exposure and the outcome. We see from the above bivariate results that the mother’s smoking behavior is associated with a decrease in the baby’s weight at birth. However, the results are not statistically significant.
To fit a best linear regression model, we may start with a simple linear regression. The code as;
model_1<-lm(weight~habit, data=nc)
summary(model_1)
##
## Call:
## lm(formula = weight ~ habit, data = nc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2468 -0.6868 0.1736 0.9236 4.3832
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.24676 0.05373 134.881 <0.0000000000000002 ***
## habitsmoker -0.36033 0.16580 -2.173 0.0301 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.438 on 798 degrees of freedom
## Multiple R-squared: 0.005884, Adjusted R-squared: 0.004638
## F-statistic: 4.723 on 1 and 798 DF, p-value: 0.03006
The results from the simple linear regression model smoking habit of mothers negatively associated with weight (at birth). These results further reaffirm our primary objective.
According to the previous relevant research, other variables like length of pregnancy, gender of baby, premature status and mother’s age may be potential confounders that should consider in our model (Zheng W et.al 2016; Kharkova OA et.al 2017). Let’s go step by step.
The following step will be to proceed with the `Likelihood Ratio Test (LRT). However, Akaike Information Criterion (AIC) is one of the best model selection criteria we will not apply to our project.
Step-1 : Adjusting mother age
model_1<-lm(weight~habit, data=nc)
model_2<-lm(weight~habit+mage, data=nc) # Add mother age (mage)
lrtest(model_1,model_2)
## Likelihood ratio test
##
## Model 1: weight ~ habit
## Model 2: weight ~ habit + mage
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 3 -1424.5
## 2 4 -1423.8 1 1.5086 0.2193
model_1 is best fit.
Step-2 : Adjusting gender of baby
model_1<-lm(weight~habit, data=nc)
model_3<-lm(weight~habit+gender, data=nc) # Add gender of baby
lrtest(model_1,model_3)
## Likelihood ratio test
##
## Model 1: weight ~ habit
## Model 2: weight ~ habit + gender
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 3 -1424.5
## 2 4 -1416.5 1 16.119 0.00005947 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_2 is best fit.
Step-3 : Update model_2 as model_3 by length of pregnancy
model_1<-lm(weight~habit, data=nc)
model_3<-lm(weight~habit+gender+weeks, data=nc) ## Add length of pregnancy in weeks.
lrtest(model_1,model_3)
## Likelihood ratio test
##
## Model 1: weight ~ habit
## Model 2: weight ~ habit + gender + weeks
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 3 -1424.5
## 2 5 -1204.9 2 439.25 < 0.00000000000000022 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_3 is best fit.
Step-3 : Update model_3 as model_4 by premature status of baby
model_1<-lm(weight~habit, data=nc)
model_4<-lm(weight~habit+gender+weeks+premie, data=nc) ## Add premature (premie) or full-term as a categorical variable.
lrtest(model_1,model_4)
## Likelihood ratio test
##
## Model 1: weight ~ habit
## Model 2: weight ~ habit + gender + weeks + premie
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 3 -1424.5
## 2 6 -1196.3 3 456.4 < 0.00000000000000022 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_4 is best fit.
We may consider model_4 as the best model. This was an example of the way to select a linear model. Stepwise regression may another good option for fitting the best model (X Liao et.al 2007).
Let’s see the final model results of the summary,
model_4<-lm(weight~habit+gender+weeks+premie, data=nc)
summary(model_4)
##
## Call:
## lm(formula = weight ~ habit + gender + weeks + premie, data = nc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3875 -0.6776 -0.0282 0.6646 4.0952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.25410 0.79449 -4.096 0.000046378 ***
## habitsmoker -0.30512 0.12513 -2.438 0.015 *
## gendermale 0.39930 0.07679 5.200 0.000000254 ***
## weeks 0.27006 0.02025 13.335 < 0.0000000000000002 ***
## premiepremie -0.68312 0.16461 -4.150 0.000036844 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.083 on 795 degrees of freedom
## Multiple R-squared: 0.4381, Adjusted R-squared: 0.4353
## F-statistic: 154.9 on 4 and 795 DF, p-value: < 0.00000000000000022
The independent effect of smoking intensity on birth weight was estimated by correcting for the potential confounding variables in the adjusted regression model (model_4). Newborn weight decreased as the category of the number of smoker moms increased, with a significant reduction of 0.30 (p<0.05) lower than that of infants born to nonsmoker mothers. Similarly, the weight of baby at birth decreased as the premature increased, with a significant reduction of 0.68 (p<0.01) lower than that of full-term. Whereas newborn weight increased as the length of pregnancy in weeks increased and for male babies respectively 0.27 (p<0.01) and 0.40 (p<0.01).
Some limitations of this study needed to be addressed.
First, some confounding factors were not investigated, such as maternal BMI, socio-economic status (SES), and lifestyles might be different in smoking and non-smoking mothers, and these differences may contribute to the differences in the birthweight of their infants.
Second, as we analyzed data from secondary sources, there may be biases related to collecting information using a questionnaire survey, especially when information was collected retrospectively.
Third, the number of cigarettes eaten per day was unavailable in this dataset. The amount of taken cigarette habit is very essential to quantify the association.
Last but not least, due to the unavailability of a continuous exposure variable, we were unable to fit a nonlinear association.
In conclusion, this study confirmed the increased risk of giving birth to low birthweight infants in North Carolina`s mothers with smoking habits during pregnancy and indicated that premature had a confounding effect on this association, although it is unclear on the basis of the current evidence whether the effect was caused by age or by other potential age-related factors. However, the consistent difference across age groups in various ethnic populations and time periods indicates the need for intervention. No matter what the mechanism is, it is necessary to pay special attention to older mothers with smoking habits when carrying out education programs. Additionally, special perinatal care may be essential for older smoking mothers because of the higher proportion of fetal growth restriction in their infants.
Zheng W, Suzuki K, Tanaka T, Kohama M, Yamagata Z; Okinawa Child Health Study Group. Association between Maternal Smoking during Pregnancy and Low Birthweight: Effects by Maternal Age. PLoS One. 2016 Jan 21;11(1):e0146241. doi: 10.1371/journal.pone.0146241. PMID: 26795494; PMCID: PMC4721610.
Kharkova OA, Grjibovski AM, Krettek A, Nieboer E, Odland JØ. Effect of Smoking Behavior before and during Pregnancy on Selected Birth Outcomes among Singleton Full-Term Pregnancy: A Murmansk County Birth Registry Study. Int J Environ Res Public Health. 2017 Aug 2;14(8):867. doi: 10.3390/ijerph14080867. PMID: 28767086; PMCID: PMC5580571.
Liao, X., Li, Q., Yang, X. et al. Multiobjective optimization for crash safety design of vehicles using stepwise regression model. Struct Multidisc Optim 35, 561–569 (2008). https://doi.org/10.1007/s00158-007-0163-x
Hayes C, Kearney M, O’Carroll H, Zgaga L, Geary M, Kelleher C. Patterns of Smoking Behaviour in Low-Income Pregnant Women: A Cohort Study of Differential Effects on Infant Birth Weight. Int J Environ Res Public Health. 2016 Oct 29;13(11):1060. doi: 10.3390/ijerph13111060. PMID: 27801861; PMCID: PMC5129270.
Kataoka, M.C., Carvalheira, A.P.P., Ferrari, A.P. et al. Smoking during pregnancy and harm reduction in birth weight: a cross-sectional study. BMC Pregnancy Childbirth 18, 67 (2018). https://doi.org/10.1186/s12884-018-1694-4