Description

The project aims to explore the relation between expectant mothers’ habits and practices and their children’s birth based on “nc” datasets. In 2004, the state of North Carolina released an extensive data set containing information on all births recorded in their state.

Our main objective of this study is to find whether smoking status of mother during pregnancy time associate with low birth weight of baby or not.

We will investigate the association using essential statistical tools, such as ‘Descriptive Statistics’, ‘Data visualization’ (normality check, distribution of variables, histogram, and Boxplot), and ‘Quantitive Data Analysis’ by linear and nonlinear regression models.

Load the relevant packages

Load the relevant packages: “statsr”, “nc”, “tidyverse” and “ggplot2”.

  • The statsr R package: The package includes data sets, functions and Shiny Applications for learning frequentist and Bayesian statistics with R.

  • The nc R package: for data and custom functions with the statsr resources.

  • The tidyverse “umbrella” package which houses a suite of many different R packages: for data wrangling and data visualization.

  • The ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics.

  • The ggpubr package provides some easy-to-use functions for creating and customizing ‘ggplot2’- based publication ready plots

  • The ggridges package provides two main geoms, geom_ridgeline and geom_density_ridges. The former takes height values directly to draw ridgelines, and the latter first estimates data densities and then draws those using ridgelines.

  • The epitools tools for training and practicing epidemiologists including methods for two-way and multi-way contingency tables.

  • The lmtest A collection of tests, data sets, and examples for diagnostic checking in linear regression models. Furthermore, some generic tools for inference in parametric models are provided.

library(statsr)
library(tidyverse)
library(ggplot2) 
library(ggpubr) 
library(ggridges) 
library(epitools)
library(lmtest)

The following command extends the number of lines of printing your results in the console.

options(scipen = 999) 

North Carolina births record

In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. In this dataset we have 1000 observations.

Data Source: State of North Carolina.

  • Load the nc data set into our workspace.

By eyeballing of the output the data set (data frame R name) contain with three columns and eighty two rows 1000 x 13.

data(nc)

We can also check the dimensions of this data frame as well as the names of the variables, type of variables and the first few observations by inserting the name of the data set into the glimpse() function, as seen below:

glimpse(nc)
## Rows: 1,000
## Columns: 13
## $ fage           <int> NA, NA, 19, 21, NA, NA, 18, 17, NA, 20, 30, NA, NA, NA,~
## $ mage           <int> 13, 14, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16,~
## $ mature         <fct> younger mom, younger mom, younger mom, younger mom, you~
## $ weeks          <int> 39, 42, 37, 41, 39, 38, 37, 35, 38, 37, 45, 42, 40, 38,~
## $ premie         <fct> full term, full term, full term, full term, full term, ~
## $ visits         <int> 10, 15, 11, 6, 9, 19, 12, 5, 9, 13, 9, 8, 4, 12, 15, 7,~
## $ marital        <fct> married, married, married, married, married, married, m~
## $ gained         <int> 38, 20, 38, 34, 27, 22, 76, 15, NA, 52, 28, 34, 12, 30,~
## $ weight         <dbl> 7.63, 7.88, 6.63, 8.00, 6.38, 5.38, 8.44, 4.69, 8.81, 6~
## $ lowbirthweight <fct> not low, not low, not low, not low, not low, low, not l~
## $ gender         <fct> male, male, female, male, female, male, male, male, mal~
## $ habit          <fct> nonsmoker, nonsmoker, nonsmoker, nonsmoker, nonsmoker, ~
## $ whitemom       <fct> not white, not white, white, white, not white, not whit~

We have observations of 13 different variables, some categorical and some numerical. The meaning of each variable is as follows:

variable description
fage father’s age in years.
mage mother’s age in years.
mature maturity status of mother.
weeks length of pregnancy in weeks.
premie whether the birth was classified as premature (premie) or full-term.
visits number of hospital visits during pregnancy.
marital whether mother is married or not married at birth.
gained weight gained by mother during pregnancy in pounds.
weight weight of the baby at birth in pounds.
lowbirthweight whether baby was classified as low birthweight (low) or not (not low).
gender gender of the baby, female or male.
habit status of the mother as a nonsmoker or a smoker.
whitemom whether mom is white or not white.

Discriptive Statistics

As a first step in the analysis, we should consider summaries of the data. This can be done using the summary command:

summary(nc)
##       fage            mage            mature        weeks             premie   
##  Min.   :14.00   Min.   :13   mature mom :133   Min.   :20.00   full term:846  
##  1st Qu.:25.00   1st Qu.:22   younger mom:867   1st Qu.:37.00   premie   :152  
##  Median :30.00   Median :27                     Median :39.00   NA's     :  2  
##  Mean   :30.26   Mean   :27                     Mean   :38.33                  
##  3rd Qu.:35.00   3rd Qu.:32                     3rd Qu.:40.00                  
##  Max.   :55.00   Max.   :50                     Max.   :45.00                  
##  NA's   :171                                    NA's   :2                      
##      visits            marital        gained          weight      
##  Min.   : 0.0   married    :386   Min.   : 0.00   Min.   : 1.000  
##  1st Qu.:10.0   not married:613   1st Qu.:20.00   1st Qu.: 6.380  
##  Median :12.0   NA's       :  1   Median :30.00   Median : 7.310  
##  Mean   :12.1                     Mean   :30.33   Mean   : 7.101  
##  3rd Qu.:15.0                     3rd Qu.:38.00   3rd Qu.: 8.060  
##  Max.   :30.0                     Max.   :85.00   Max.   :11.750  
##  NA's   :9                        NA's   :27                      
##  lowbirthweight    gender          habit          whitemom  
##  low    :111    female:503   nonsmoker:873   not white:284  
##  not low:889    male  :497   smoker   :126   white    :714  
##                              NA's     :  1   NA's     :  2  
##                                                             
##                                                             
##                                                             
## 

According to the variables of the summaries, the minimum mother age was 13 years old. Those are signs of teenage pregnancy. We suspected there would be outliers for numerical variables. A component of the data visualization will be able to detect anomalies. There are a few variables that have missing values. In order to perform further analysis, the dataset needs to be cleaned.

Handling missing observation

Before the initial data analysis, we should check missing observations in the dataset. If there have a missing cell, we need to omit it.

sum(is.na(nc))
## [1] 215

The dataset have 215 total missing observations.

nc<-nc[complete.cases(nc),]

# Check again total missing observation

sum(is.na(nc)) # Missing observation is 0
## [1] 0

Now the dataset is full length, we may prceesed for further analysis.

Data visualization

R has some powerful functions for making graphics.

Histograms

Let check the distrution of weight of the baby at birth in pounds by maturity status of mother and status of the mother as a nonsmoker or a smoker.

nc %>%
  ggplot(aes(x=weight, fill=habit, color=habit))+
  geom_histogram(position = "identity", alpha=0.5)+
  labs(title = "Baby weight by mother habit", x="Baby weight at birth (in Pounds)", y="Count")+
  theme(plot.title = element_text(hjust = 0.5))+
  theme(legend.position="bottom")

In the figure above, we can see that non-smokers give birth to babies of greater weight than smokers. It is consistent with our study objective that smoker mothers are associated with give low birth weight babies.

Ridge Plots

Baby weight by maturity status of mother and gender of the baby

Maturity status of mother adds another dimension to our analysis. We would like to identify if there exists any relationship between mature mother with baby weight at birth. We can create Ridge Plots to identify such relationships.

nc %>%
  ggplot(aes(x=weight, y=mature, fill=mature))+
  geom_density_ridges()+
  facet_grid(gender~habit, labeller=labeller(.cols=label_both))+
  labs(title = "Cluster 2", x="Baby weight at birth (in Pounds)", y="Maturity status of mother")+
  scale_color_viridis_c()+
  theme_bw()+
  theme(plot.title = element_text(hjust = 0.5))

The Ridge Plots above further reaffirm our hypothesis. We can immediately take notice of the fact that counties that saw population loss had no representation of individuals with Bachelors’ degrees in the median household income distribution, while on the flip side, counties that saw population gain had a larger share of the median household income distribution made up of those who had a Bachelors’ degrees (The results are consistent with Cluster_1). Gained-metro and Bachelors’ degrees distribution surprisingly shown tri-models pattern.

The Ridge Plots above further reaffirm our hypothesis. We can immediately take notice of the fact that smoker mon give low weight birth baby as it is shown left skweed (botom-right panel), while on the flip side, non-smoker mom had given large number of baby weight.

To better visualization Boxplot may be one of the another powerful tool.

Boxplots for weight of the baby at birth distribution

Box plots are another great way to visualize the heterogeneity in baby weight at birth.

nc %>%
  ggplot(aes(x=weight, y= habit,
             color=habit))+
  geom_boxplot()+
  labs(title = "A", x="Baby weight at birth (in Pounds)", y="Count")+
  theme(plot.title = element_text(hjust = 0.5))+
  theme(legend.position="right")

As expect, we see that nonsmoker has a lot more variance when it comes to weight. This is in addition to a higher median weight.

The box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the following function to split the weight variable into the habit groups, then take the mean of each using the mean function.

by(nc$weight, nc$habit, mean)
## nc$habit: nonsmoker
## [1] 7.24676
## ------------------------------------------------------------ 
## nc$habit: smoker
## [1] 6.886429

There is an observed difference, but is this difference statistically significant? In order to answer this question we will conduct a hypothesis test in the main analysis part.

Statistical model

Test of hypothesis

Before going to building statistical model test of hypothesis one of the importance part of analysis.

H(0): Mean(weight | smoker) = Mean (weight | nonsmoker)

H(A): Mean(weight | smoker) <> Mean (weight | nonsmoker)

Next, we introduce a new function, inference, that we will use for conducting hypothesis tests and constructing confidence intervals.

inference(y = weight, x = habit, data = nc, statistic = "mean", type = "ht", null = 0, alternative = "twosided", method = "theoretical")
## Response variable: numerical
## Explanatory variable: categorical (2 levels) 
## n_nonsmoker = 716, y_bar_nonsmoker = 7.2468, s_nonsmoker = 1.4521
## n_smoker = 84, y_bar_smoker = 6.8864, s_smoker = 1.3064
## H0: mu_nonsmoker =  mu_smoker
## HA: mu_nonsmoker != mu_smoker
## t = 2.3625, df = 83
## p_value = 0.0205

Let’s pause for a moment to go through the arguments of this custom function. The first argument is y, which is the response variable that we are interested in: weight. The second argument is the explanatory variable, x, which is the variable that splits the data into two groups, smokers and non-smokers: habit. The third argument, data, is the data frame these variables are stored in. Next is statistic, which is the sample statistic we’re using, or similarly, the population parameter we’re estimating. In future labs we can also work with “median” and “proportion”. Next we decide on the type of inference we want: a hypothesis test ("ht") or a confidence interval ("ci"). When performing a hypothesis test, we also need to supply the null value, which in this case is 0, since the null hypothesis sets the two population means equal to each other. The alternative hypothesis can be "less", "greater", or "twosided". Lastly, the method of inference can be "theoretical" or "simulation" based.

For more information on the inference function see the help file with ?inference.

As the p-value is smaller than the usual confidence level 0.05, we reject the null hypotheses. This means this data set provides enought evidence that the average weights of babies born to smoking and non-smoking mothers are different. We may proceed for further analysis.

Bivariate Analysis

Risk Ratio

You can test for the statistical relationship between two categorical variables using the chisq.test() function. At it’s base, it takes two variables as input.

oddsratio(table(nc$habit, nc$lowbirthweight))
## $data
##            
##             low not low Total
##   nonsmoker  62     654   716
##   smoker     12      72    84
##   Total      74     726   800
## 
## $measure
##            odds ratio with 95% C.I.
##              estimate     lower    upper
##   nonsmoker 1.0000000        NA       NA
##   smoker    0.5640745 0.2986628 1.148773
## 
## $p.value
##            two-sided
##             midp.exact fisher.exact chi.square
##   nonsmoker         NA           NA         NA
##   smoker     0.1100541    0.1088812 0.09221694
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "median-unbiased estimate & mid-p exact CI"

The magnitude of the odds ratio is called the “strength of the association.” The further away an odds ratio is from 1.0, the more likely the causal relationship between the exposure and the outcome. We see from the above bivariate results that the mother’s smoking behavior is associated with a decrease in the baby’s weight at birth. However, the results are not statistically significant.

Linear regression model

To fit a best linear regression model, we may start with a simple linear regression. The code as;

model_1<-lm(weight~habit, data=nc)
summary(model_1)
## 
## Call:
## lm(formula = weight ~ habit, data = nc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2468 -0.6868  0.1736  0.9236  4.3832 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  7.24676    0.05373 134.881 <0.0000000000000002 ***
## habitsmoker -0.36033    0.16580  -2.173              0.0301 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.438 on 798 degrees of freedom
## Multiple R-squared:  0.005884,   Adjusted R-squared:  0.004638 
## F-statistic: 4.723 on 1 and 798 DF,  p-value: 0.03006

The results from the simple linear regression model smoking habit of mothers negatively associated with weight (at birth). These results further reaffirm our primary objective.

According to the previous relevant research, other variables like length of pregnancy, gender of baby, premature status and mother’s age may be potential confounders that should consider in our model (Zheng W et.al 2016; Kharkova OA et.al 2017). Let’s go step by step.

The following step will be to proceed with the `Likelihood Ratio Test (LRT). However, Akaike Information Criterion (AIC) is one of the best model selection criteria we will not apply to our project.

Step-1 : Adjusting mother age

model_1<-lm(weight~habit, data=nc)
model_2<-lm(weight~habit+mage, data=nc) # Add mother age (mage)

lrtest(model_1,model_2)
## Likelihood ratio test
## 
## Model 1: weight ~ habit
## Model 2: weight ~ habit + mage
##   #Df  LogLik Df  Chisq Pr(>Chisq)
## 1   3 -1424.5                     
## 2   4 -1423.8  1 1.5086     0.2193

model_1 is best fit.

Step-2 : Adjusting gender of baby

model_1<-lm(weight~habit, data=nc)
model_3<-lm(weight~habit+gender, data=nc) # Add gender of baby 

lrtest(model_1,model_3)
## Likelihood ratio test
## 
## Model 1: weight ~ habit
## Model 2: weight ~ habit + gender
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   3 -1424.5                         
## 2   4 -1416.5  1 16.119 0.00005947 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model_2 is best fit.

Step-3 : Update model_2 as model_3 by length of pregnancy

model_1<-lm(weight~habit, data=nc)
model_3<-lm(weight~habit+gender+weeks, data=nc) ## Add length of pregnancy in weeks.

lrtest(model_1,model_3)
## Likelihood ratio test
## 
## Model 1: weight ~ habit
## Model 2: weight ~ habit + gender + weeks
##   #Df  LogLik Df  Chisq            Pr(>Chisq)    
## 1   3 -1424.5                                    
## 2   5 -1204.9  2 439.25 < 0.00000000000000022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model_3 is best fit.

Step-3 : Update model_3 as model_4 by premature status of baby

model_1<-lm(weight~habit, data=nc)
model_4<-lm(weight~habit+gender+weeks+premie, data=nc) ## Add premature (premie) or full-term as a categorical variable.

lrtest(model_1,model_4)
## Likelihood ratio test
## 
## Model 1: weight ~ habit
## Model 2: weight ~ habit + gender + weeks + premie
##   #Df  LogLik Df Chisq            Pr(>Chisq)    
## 1   3 -1424.5                                   
## 2   6 -1196.3  3 456.4 < 0.00000000000000022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model_4 is best fit.

We may consider model_4 as the best model. This was an example of the way to select a linear model. Stepwise regression may another good option for fitting the best model (X Liao et.al 2007).

Let’s see the final model results of the summary,

model_4<-lm(weight~habit+gender+weeks+premie, data=nc)
summary(model_4)
## 
## Call:
## lm(formula = weight ~ habit + gender + weeks + premie, data = nc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3875 -0.6776 -0.0282  0.6646  4.0952 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  -3.25410    0.79449  -4.096          0.000046378 ***
## habitsmoker  -0.30512    0.12513  -2.438                0.015 *  
## gendermale    0.39930    0.07679   5.200          0.000000254 ***
## weeks         0.27006    0.02025  13.335 < 0.0000000000000002 ***
## premiepremie -0.68312    0.16461  -4.150          0.000036844 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.083 on 795 degrees of freedom
## Multiple R-squared:  0.4381, Adjusted R-squared:  0.4353 
## F-statistic: 154.9 on 4 and 795 DF,  p-value: < 0.00000000000000022

The independent effect of smoking intensity on birth weight was estimated by correcting for the potential confounding variables in the adjusted regression model (model_4). Newborn weight decreased as the category of the number of smoker moms increased, with a significant reduction of 0.30 (p<0.05) lower than that of infants born to nonsmoker mothers. Similarly, the weight of baby at birth decreased as the premature increased, with a significant reduction of 0.68 (p<0.01) lower than that of full-term. Whereas newborn weight increased as the length of pregnancy in weeks increased and for male babies respectively 0.27 (p<0.01) and 0.40 (p<0.01).

Limitation of the study

Some limitations of this study needed to be addressed.

First, some confounding factors were not investigated, such as maternal BMI, socio-economic status (SES), and lifestyles might be different in smoking and non-smoking mothers, and these differences may contribute to the differences in the birthweight of their infants.

Second, as we analyzed data from secondary sources, there may be biases related to collecting information using a questionnaire survey, especially when information was collected retrospectively.

Third, the number of cigarettes eaten per day was unavailable in this dataset. The amount of taken cigarette habit is very essential to quantify the association.

Last but not least, due to the unavailability of a continuous exposure variable, we were unable to fit a nonlinear association.

Summary and Conclusion

In conclusion, this study confirmed the increased risk of giving birth to low birthweight infants in North Carolina`s mothers with smoking habits during pregnancy and indicated that premature had a confounding effect on this association, although it is unclear on the basis of the current evidence whether the effect was caused by age or by other potential age-related factors. However, the consistent difference across age groups in various ethnic populations and time periods indicates the need for intervention. No matter what the mechanism is, it is necessary to pay special attention to older mothers with smoking habits when carrying out education programs. Additionally, special perinatal care may be essential for older smoking mothers because of the higher proportion of fetal growth restriction in their infants.

References

  1. Zheng W, Suzuki K, Tanaka T, Kohama M, Yamagata Z; Okinawa Child Health Study Group. Association between Maternal Smoking during Pregnancy and Low Birthweight: Effects by Maternal Age. PLoS One. 2016 Jan 21;11(1):e0146241. doi: 10.1371/journal.pone.0146241. PMID: 26795494; PMCID: PMC4721610.

  2. Kharkova OA, Grjibovski AM, Krettek A, Nieboer E, Odland JØ. Effect of Smoking Behavior before and during Pregnancy on Selected Birth Outcomes among Singleton Full-Term Pregnancy: A Murmansk County Birth Registry Study. Int J Environ Res Public Health. 2017 Aug 2;14(8):867. doi: 10.3390/ijerph14080867. PMID: 28767086; PMCID: PMC5580571.

  3. Liao, X., Li, Q., Yang, X. et al. Multiobjective optimization for crash safety design of vehicles using stepwise regression model. Struct Multidisc Optim 35, 561–569 (2008). https://doi.org/10.1007/s00158-007-0163-x

  4. Hayes C, Kearney M, O’Carroll H, Zgaga L, Geary M, Kelleher C. Patterns of Smoking Behaviour in Low-Income Pregnant Women: A Cohort Study of Differential Effects on Infant Birth Weight. Int J Environ Res Public Health. 2016 Oct 29;13(11):1060. doi: 10.3390/ijerph13111060. PMID: 27801861; PMCID: PMC5129270.

  5. Kataoka, M.C., Carvalheira, A.P.P., Ferrari, A.P. et al. Smoking during pregnancy and harm reduction in birth weight: a cross-sectional study. BMC Pregnancy Childbirth 18, 67 (2018). https://doi.org/10.1186/s12884-018-1694-4