Overview

This project aims to research how do the mother’s age, father’s age, and the mother’s smoking habits affect the health of a baby, specifically the baby’s birthweight and whether or not it was born prematurely.

Introduction

I will be using a dataset called “ncbirths” from the open intro library. Data set on 1,000 randomly sampled births from the birth records released by the state of North Carolina in 2004. This data set has been of interest to medical researchers who are studying the relationship between habits and practices of expectant mothers and the birth of their children. I must analyze if the variable that was studied in ncbirths are they related to each other, how strong that relationship is?

Exploring the Data

This project will look at the ncbirths dataset, which contains information on 1,000 birth cases recorded in the state of North Carolina. It was released to the public in 2004.

#Storing the dataset into the environment
library(openintro)
## Please visit openintro.org for free statistics materials
## 
## Attaching package: 'openintro'
## The following objects are masked from 'package:datasets':
## 
##     cars, trees
ncbirths <- ncbirths
# View the structure of dataset
str(ncbirths)
## 'data.frame':    1000 obs. of  13 variables:
##  $ fage          : int  NA NA 19 21 NA NA 18 17 NA 20 ...
##  $ mage          : int  13 14 15 15 15 15 15 15 16 16 ...
##  $ mature        : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
##  $ weeks         : int  39 42 37 41 39 38 37 35 38 37 ...
##  $ premie        : Factor w/ 2 levels "full term","premie": 1 1 1 1 1 1 1 2 1 1 ...
##  $ visits        : int  10 15 11 6 9 19 12 5 9 13 ...
##  $ marital       : Factor w/ 2 levels "married","not married": 1 1 1 1 1 1 1 1 1 1 ...
##  $ gained        : int  38 20 38 34 27 22 76 15 NA 52 ...
##  $ weight        : num  7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
##  $ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
##  $ gender        : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
##  $ habit         : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...
##  $ whitemom      : Factor w/ 2 levels "not white","white": 1 1 2 2 1 1 1 1 2 2 ...

The dataset contains 1,000 randomly sampled of births. In these observations, five (5) variables were recorded which were the mother’s and, father’s age (in year), mother’s smoking habit, length of pregnancy (in week) and the birth of the baby (in pounds)

# Preview of first few lines of data
head(ncbirths)
##   fage mage      mature weeks    premie visits marital gained weight
## 1   NA   13 younger mom    39 full term     10 married     38   7.63
## 2   NA   14 younger mom    42 full term     15 married     20   7.88
## 3   19   15 younger mom    37 full term     11 married     38   6.63
## 4   21   15 younger mom    41 full term      6 married     34   8.00
## 5   NA   15 younger mom    39 full term      9 married     27   6.38
## 6   NA   15 younger mom    38 full term     19 married     22   5.38
##   lowbirthweight gender     habit  whitemom
## 1        not low   male nonsmoker not white
## 2        not low   male nonsmoker not white
## 3        not low female nonsmoker     white
## 4        not low   male nonsmoker     white
## 5        not low female nonsmoker not white
## 6            low   male nonsmoker not white
summary(ncbirths)
##       fage            mage            mature        weeks      
##  Min.   :14.00   Min.   :13   mature mom :133   Min.   :20.00  
##  1st Qu.:25.00   1st Qu.:22   younger mom:867   1st Qu.:37.00  
##  Median :30.00   Median :27                     Median :39.00  
##  Mean   :30.26   Mean   :27                     Mean   :38.33  
##  3rd Qu.:35.00   3rd Qu.:32                     3rd Qu.:40.00  
##  Max.   :55.00   Max.   :50                     Max.   :45.00  
##  NA's   :171                                    NA's   :2      
##        premie        visits            marital        gained     
##  full term:846   Min.   : 0.0   married    :386   Min.   : 0.00  
##  premie   :152   1st Qu.:10.0   not married:613   1st Qu.:20.00  
##  NA's     :  2   Median :12.0   NA's       :  1   Median :30.00  
##                  Mean   :12.1                     Mean   :30.33  
##                  3rd Qu.:15.0                     3rd Qu.:38.00  
##                  Max.   :30.0                     Max.   :85.00  
##                  NA's   :9                        NA's   :27     
##      weight       lowbirthweight    gender          habit    
##  Min.   : 1.000   low    :111    female:503   nonsmoker:873  
##  1st Qu.: 6.380   not low:889    male  :497   smoker   :126  
##  Median : 7.310                               NA's     :  1  
##  Mean   : 7.101                                              
##  3rd Qu.: 8.060                                              
##  Max.   :11.750                                              
##                                                              
##       whitemom  
##  not white:284  
##  white    :714  
##  NA's     :  2  
##                 
##                 
##                 
## 

By looking at the summary the father’s age ranges from 14 years to 55 years, the mother’s age range from 13 to 50 years, the pregnancy length ranges from 20 to 45 weeks and the weight of the baby ranges from 1 to 11.8 pounds

#Seeing how many mothers were smokers and nonsmokers
table(ncbirths$habit)
## 
## nonsmoker    smoker 
##       873       126
# Observing the summary and sample size of the father's age variable
summary(ncbirths$mage)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      13      22      27      27      32      50
table(is.na(ncbirths$mage))
## 
## FALSE 
##  1000
# Observing the summary and sample size of the father's age variable
summary(ncbirths$fage)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   14.00   25.00   30.00   30.26   35.00   55.00     171
table(is.na(ncbirths$fage))
## 
## FALSE  TRUE 
##   829   171

Plotting pregnancy length in correlation to father’s age, mother’s age, and mother’s smoking habits

To see if there is a relationship between the two variables, we should look at a scatterplot of the data.

# Creating a scatterplot for the variable mother's age as explanatory variable(X) and length of pregnancy as response variable(Y)
plot(weeks ~ mage, data = ncbirths, main="Mother's Age and Pregnancy Length", xlab="Mother's Age", ylab = "Pregnancy Length in Weeks")

# Creating a scatterplot for the variable father's age as explanatory variable(X)  and        length of pregnancy as response variable(Y)
plot(weeks ~ fage, data = ncbirths, main="Father's Age and Pregnancy Length", xlab="Father's Age", ylab = "Pregnancy Length in Weeks")

#Creating a box plot for the variables mother's smoking habit and pregnancy length
plot(weeks ~ habit, data = ncbirths, main="Mother's Smoking Habit and Pregnancy Length", xlab="Smoking Habit of Mother", ylab = "Pregnancy Length in Weeks")

Correlation Coefficients for the set of variables

#Finding the correlation coefficients for the sets of variables
cor(ncbirths$mage, ncbirths$weeks, use = "pairwise.complete.obs")
## [1] -0.03208743
cor(ncbirths$fage, ncbirths$weeks, use = "pairwise.complete.obs")
## [1] -0.01607193

Looking at the scatterplots for father’s age and pregnancy length as well as for mother’s age and pregnancy length, it appears that neither set of variables has a visibly strong correlation. By looking at the box plot of the birthweights for babies of smoking and nonsmoking mothers, it appears that the average birthweight is about the same for both groups. Although the nonsmoking mothers have a wider range of birthweights for their children, this could be due to the larger sample size of nonsmokers.

Plotting the birthweight of a baby in correlation to father’s age, mother’s age, and mother’s smoking habits

# Creating a scatterplot for mother's age and babyweight
plot(weight ~ mage, data = ncbirths, main = "Mother's Age and Baby's Birthweight", xlab="Mother's Age", ylab="Baby's Birthweight")

Looking at the scatterplots for mother’s age and baby’s birthweight there is a relatively weak linear relationship between the mother age and the birth weight.

#Creating a scatterplot for father's age and babyweight
plot(weight ~ fage , data = ncbirths, main = "Father's Age and Baby's Birthweight", xlab="Father's Age", ylab="Baby's Birthweight")

Looking at the scatterplots for father’s age and baby’s birthweight, there is a relatively weak linear relationship between the mother age and the birth weight.

# Creatiing a side-by-side boxplot of habit and weight
boxplot(weight ~ habit,data=ncbirths, main="Relation Between Mother's Smoking Habit and Baby's Weight", 
    ylab="Baby's Weight", xlab="Mother Smoker/Non-Smoker")

Interestingly we see that non-smoker has a lot more variance when it comes to weight. This is in addition to a higher median weight.

The box plot of mother’s smoking habit and baby’s birthweight shows that the average birthweight for babies of smokers is only slightly below that of nonsmoking mothers’ chidren. Although the nonsmoking mothers have a wider range of birthweights for their children with more outliers, this could be due to the larger sample size of nonsmokers.

Correlation Coefficients for the set of variables

#Finding the correlation coefficients for the sets of variables
cor(ncbirths$mage, ncbirths$weight, use = "pairwise.complete.obs")
## [1] 0.05506589
cor(ncbirths$fage, ncbirths$weight, use = "pairwise.complete.obs")
## [1] 0.07023403

Analysis

Multivariate Regression - Length of Pregnancy Mother’s Age and Father’s Age

## Creating a linear model for the variables of mother's age and father's age
lm_weeks.ages <- lm(weeks ~ mage + fage, data=ncbirths)
lm_weeks.ages
## 
## Call:
## lm(formula = weeks ~ mage + fage, data = ncbirths)
## 
## Coefficients:
## (Intercept)         mage         fage  
##   38.747494    -0.022717     0.009309

The regression equation for this model is: \(y = 38.747494 ??? 0.022717(x1) + 0.009309(x2)\)

# Finding the R-squared value with the summary function
summary(lm_weeks.ages)$r.square
## [1] 0.001178112

The multiple R-squared value of 0.001178 indicates that approximately 0.12% of variance in the length of time a pregnancy lasts is dependent upon the mother’s age and father’s age. This implies that these variables have little to no correlation to the duration of a pregancy.

Multivariate Regression - Birthweight Mother’s Age and Father’s Age

# Creating a linear model for the variables of mother's age and father's age
lm_weight.ages <- lm(weight ~ mage + fage, data=ncbirths)
lm_weight.ages
## 
## Call:
## lm(formula = weight ~ mage + fage, data = ncbirths)
## 
## Coefficients:
## (Intercept)         mage         fage  
##    6.686128     0.005144     0.011728

The regression equation for this model is: \(y = 6.686128 + 0.005144(x1) + 0.011728(x2)\)

# Finding the R-squared value with the summary function
summary(lm_weight.ages)$r.squared
## [1] 0.005112105

The multiple R-squared value of 0.005112 indicates that 0.51% of variance in birthweight is dependent upon the mother’s age and father’s age. This implies that these variables have little to no correlation to the birthweight of a baby.

P-value of getting sample statistics by chance

# Smokers subset
ncsmokers <- subset(ncbirths, ncbirths$habit == "smoker")

# Nonsmokers subset
ncnonsmokers <- subset(ncbirths, ncbirths$habit == "nonsmoker")

# Mean difference in weights (x1 - x2)
mean(ncsmokers$weight) - mean(ncnonsmokers$weight)
## [1] -0.3155425
summary(ncsmokers$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.690   6.077   7.060   6.829   7.735   9.190
summary(ncnonsmokers$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   6.440   7.310   7.144   8.060  11.750

The plot highlights the association between the two variables. From the plot we see that babies born to non-smokers on average have higher weights than those born to smokers.(Median weight is lower for smokers) So habits seems to affect birth weight.

Checking if the conditions necessary for inference are satisfied

# Sample sizes to check the conditions
by(ncbirths$weight, ncbirths$habit, length)
## ncbirths$habit: nonsmoker
## [1] 873
## -------------------------------------------------------- 
## ncbirths$habit: smoker
## [1] 126

The hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.

\(Ho: mu_smoking = mu_nonsmoking\) \(H0 : There is no significant difference between the means of children born from smoking versus non-smoking mothers.\)

\(Ho: mu_smoking \neq mu_nonsmoking\) \(HA : There is a significant difference between means.\)

The hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.

# store standard error
n1 <- 126
n2 <- 873
se <- sqrt((sd(ncsmokers$weight)^2/n1) + (sd(ncnonsmokers$weight)^2/n2))

# test statisti
ts <- (mean(ncsmokers$weight) - mean(ncnonsmokers$weight))/se

# p-value
pt(ts, df = 125)
## [1] 0.009936814

Make decision to reject H0 or fail to reject H0 at a significance level of 0.05 based on p-value.
p-value is less than alpha, therefore we reject the H0

Conclusions

Based on the data, the variables of mother’s age, father’s age, and the mother’s smoking habit are all poor predictors of the health of a baby concerning birthweight and if the baby is likely to be born prematurely. From my p-value the results we do see that baby’s from smoking birth mothers will not have the same average weight as baby’s from non-smoking birth mothers.

Limitations

The most significant limitation of this data is the lack of variables focusing on the health of the baby. The only information present shows whether the baby had a healthy birthweight, and if the pregnancy lasted long enough for the baby to not be born prematurely. There were no variables on the health of the baby outside of these two factors, such as birth defects or whether a health condition was present. If there were other variables of the baby’s condition to analyze, it may have been found that the age of a parent or if the baby was carried by a smoking mother have an impact on the overall health of the baby. However, because no such variable is present, it appears that the variables of parents’ age and smoking have no impact, even though in reality they may have an effect on other factors of the baby’s health.

Another limitation is that this data does not express any information about miscarriages. There is no way to tell how many other parents were going to have a child but lost their baby, potentially due to one of the variables of age or smoking habit. If parents who lost their child had been included in the sample data, more conclusions may have been drawn about the impacts of these variables.


This document was produced as a final project for MAT 143H - Introduction to Statistics (Honors) at North Shore Community College.
The course was led by Professor Billy Jackson.
Student Name:
Semester: