Overview

This project aims to research how the mother’s age, father’s age, and the mother’s smoking habits affect the health of a baby, specifically the baby’s birthweight and whether or not it was born prematurely.

Introduction

This project will look at the ncbirths dataset, which contains information on 1,000 birth cases recorded in the state of North Carolina. It was released to the public in 2004.

#Storing the dataset into the environment
ncbirths <- openintro::ncbirths

Part 1: Exploratory Analysis

#Seeing how many mothers were smokers and nonsmokers
table(ncbirths$habit)
## 
## nonsmoker    smoker 
##       873       126

#Observing the summary and sample size of the father's age variable
summary(ncbirths$fage)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   14.00   25.00   30.00   30.26   35.00   55.00     171
table(is.na(ncbirths$fage))
## 
## FALSE  TRUE 
##   829   171

#Observing the summary and sample size of the mother's age variable
summary(ncbirths$mage)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      13      22      27      27      32      50
table(is.na(ncbirths$mage))
## 
## FALSE 
##  1000

Plotting pregnancy length in correlation to father’s age, mother’s age, and mother’s smoking habits

#Creating a scatterplot
plot(weeks ~ fage, data = ncbirths, main="Father's Age and Pregnancy Length", xlab="Father's Age", ylab = "Pregnancy Length in Weeks")

#Creating a scatterplot
plot(weeks ~ mage, data = ncbirths, main="Mother's Age and Pregnancy Length", xlab="Mother's Age", ylab = "Pregnancy Length in Weeks")

#Creating a box plot
plot(weeks ~ habit, data = ncbirths, main="Mother's Smoking Habit and Pregnancy Length", xlab="Smoking Habit of Mother", ylab = "Pregnancy Length in Weeks")

Looking at the scatterplots for father’s age and pregnancy length as well as for mother’s age and pregnancy length, it appears that neither set of variables has a visibly strong correlation. By looking at the box plot of the birthweights for babies of smoking and nonsmoking mothers, it appears that the average birthweight is about the same for both groups. Although the nonsmoking mothers have a wider range of birthweights for their children, this could be due to the larger sample size of nonsmokers.

Plotting the birthweight of a baby in correlation to father’s age, mother’s age, and mother’s smoking habits

#Creating a scatterplot
plot(weight ~ fage, data = ncbirths, main = "Father's Age and Baby's Birthweight", xlab="Father's Age", ylab="Baby's Birthweight")

#Creating a scatterplot
plot(weight ~ mage, data = ncbirths, main = "Mother's Age and Baby's Birthweight", xlab="Mother's Age", ylab="Baby's Birthweight")

#Creating a box plot
plot(weight ~ habit, data = ncbirths, main = "Mother's Smoking Habit and Baby's Birthweight", xlab="Mother's Smoking Habit", ylab="Baby's Birthweight")

Looking at the scatterplots for father’s age and baby’s birthweight as well as for mother’s age and baby’s birthweight, there does not appear to be a visibly strong correlation between these sets of variables.

The box plot of mother’s smoking habit and baby’s birthweight shows that the average birthweight for babies of smokers is only slightly below that of nonsmoking mothers’ chidren. Although the nonsmoking mothers have a wider range of birthweights for their children with more outliers, this could be due to the larger sample size of nonsmokers.

Part 2: Analyzing the impact of variables on pregnancy length

Correlation Coefficients

#Finding the correlation coefficients for the sets of variables
cor(ncbirths$fage, ncbirths$weeks, use = "pairwise.complete.obs")
## [1] -0.01607193
cor(ncbirths$mage, ncbirths$weeks, use = "pairwise.complete.obs")
## [1] -0.03208743

Based on the correlation coefficients of approximately -0.016 (father’s age/pregnancy length) and -0.032 (mother’s age/pregnancy length), there appears to be no correlation between the parents’ age and the length of time a pregnancy will last.

Proportions and Hypothesis Test

#Creating a table to show how many babies were and were not born prematurely for smoking and nonsmoking mothers
table(ncbirths$habit, ncbirths$premie)
##            
##             full term premie
##   nonsmoker       739    133
##   smoker          107     19
#Creating a proportion of how many babies were born prematurely for nonsmoking mothers
133/(133+739)
## [1] 0.1525229
#Creating a proportion of how many babies were born prematurely for smoking mothers
19/(107+19)
## [1] 0.1507937

By looking at the proportions of nonsmoking and smoking mothers that gave birth to premature babies, it can be seen that approximately 15.25% of nonsmoking mothers gave birth prematurely, and approximately 15.08% of smoking mothers gave birth prematurely.

A hypothesis test can be used to test if the two proportions have a true differene, or if they only appear different due to variation.

Hypothesis:

\(H_0: p_1 = p_2\)

\(H_A: p_1 \neq p_2\)

#Creating the pooled proportion
x1 <- 133
x2 <- 19
n1 <-133+739
n2 <-19+107
  
ppool.preg <- (x1 + x2) / (n1 + n2)
q <- (1 - ppool.preg)
  
#Finding the standard error
se.preg <- sqrt(((ppool.preg*q)/n1)+((ppool.preg*q)/n2))

#Finding the test statistic
p1 <- x1/n1
p2 <- x2/n2
  
((p1-p2)-0)/se.preg
## [1] 0.05049733

#Finding the p-value
pnorm(0.05049733, lower.tail = FALSE)*2
## [1] 0.9597261

The p-value of 0.9597 is greater than alpha (0.05), and we therefore fail to reject the null hypothesis. This implies that a mother’s smoking habit has no impact upon the length of a pregnancy, as there was not sufficient evidence to say that there was a difference in the proportion of babies being born prematurely for nonsmoking and smoking mothers.

Conclusions about these variables and pregnancy length

The data suggests that the variables of father’s age, mother’s age, and mother’s smoking habit are poor predictors of the length of time a pregnancy will last. There was no strong correlation suggested by the plot, there were low correlation coefficeints for the variables, and there was not sufficient evidence to suggest a difference between the proportion of premature births for nonsmoking and smoking mothers.

Part 3: Analyzing the impact of variables on birthweight

Correlation Coefficients

#Finding the correlation coefficients for the sets of variables
cor(ncbirths$fage, ncbirths$weight, use = "pairwise.complete.obs")
## [1] 0.07023403
cor(ncbirths$mage, ncbirths$weight, use = "pairwise.complete.obs")
## [1] 0.05506589

The correlation coefficient for father’s age and the baby’s birthweight is 0.0702, while the correlation coefficient for the mother’s age and baby’s birthweight is approximately 0.0551. Both of these values are significantly low, implying that there is no apparent correlation between the variables of either parents’ age and the birthweight of the baby.

Proportions and Hypothesis Test

#Creating a table to show how many babies did and didn't have low birthweights for smoking and nonsmoking mothers
table(ncbirths$habit, ncbirths$lowbirthweight)
##            
##             low not low
##   nonsmoker  92     781
##   smoker     18     108

#Looking at the proportion of babies with low birthweights for nonsmokers
92/(92+781)
## [1] 0.1053837

#Looking at the proportion of babies with low birthweights for smokers
18/(18+108)
## [1] 0.1428571

By looking at the proportions of what percentage of babies born had low birthweights with smoking and nonsmoking mothers, it can be seen that approximately 10.54% of nonsmoking mothers gave birth to babies with a low birthweight, and approximately 14.29% of smokers gave birth to babies with a low birthweight.

A hypothesis test can be used to test if the two proportions have a true differene, or if they only appear different due to variation.

Hypothesis:

\(H_0: p_1 = p_2\)

\(H_A: p_1 \neq p_2\)

#Creating the pooled proportion
x.1 <- 92
x.2 <- 18
n.1 <-92+781
n.2 <-18+108
  
ppool.weight <- (x.1 + x.2) / (n.1 + n.2)
q.w <- (1 - ppool.weight)
  
#Finding the standard error
se.weight <- sqrt(((ppool.weight*q.w)/n.1)+((ppool.weight*q.w)/n.2))

#Finding the test statistic
p.1 <- x.1/n.1
p.2 <- x.2/n.2
  
((p.1-p.2)-0)/se.weight
## [1] -1.256178

#Finding the p-value
pnorm(-1.256178)*2
## [1] 0.2090514

The p-value of 0.209 is greater than alpha (0.05), and we therefore fail to reject the null hypothesis. There is not sufficient evidence to say that the proportions of babies with low birthweights born to nonsmoking and smoking mothers truly differ from one another, implying that smoking does not affect the birthweight of a baby.

Conclusions about these variables and birthweight

The data suggests that the variables of father’s age, mother’s age, and mother’s smoking habit are poor predictors of a baby’s birthweight. There was no strong correlation suggested by the plots, there were low correlation coefficients for the variables, and there was not sufficient evidence to say that there was a difference between the proportions of babies born with a low birthweight for nonsmoking and smoking mothers.

Part 4: Multivariate Regression - Length of Pregnancy

Mother’s Age and Father’s Age

#Creating a linear model for the variables of mother's age and father's age
lm.weeks.ages <- lm(weeks ~ mage+fage, data=ncbirths)
lm.weeks.ages
## 
## Call:
## lm(formula = weeks ~ mage + fage, data = ncbirths)
## 
## Coefficients:
## (Intercept)         mage         fage  
##   38.747494    -0.022717     0.009309

The regression equation for this model is:

\(y=38.747494 - 0.022717(x_1) + 0.009309(x_2)\)

#Finding the R-squared value with the summary function
summary(lm.weeks.ages)$r.squared
## [1] 0.001178112

The multiple R-squared value of 0.001178 indicates that approximately 0.12% of variance in the length of time a pregnancy lasts is dependent upon the mother’s age and father’s age. This implies that these variables have little to no correlation to the duration of a pregancy.

Part 5: Multivariate Regression - Birthweight

Mother’s Age and Father’s Age

#Creating a linear model for the variables of mother's age and father's age
lm.weight.ages <- lm(weight ~ mage+fage, data=ncbirths)
lm.weight.ages
## 
## Call:
## lm(formula = weight ~ mage + fage, data = ncbirths)
## 
## Coefficients:
## (Intercept)         mage         fage  
##    6.686128     0.005144     0.011728

The regression equation for this model is:

\(y=6.686128 + 0.005144(x_1) + 0.011728(x_2)\)

#Finding the R-squared value with the summary function
summary(lm.weight.ages)$r.squared
## [1] 0.005112105

The multiple R-squared value of 0.005112 indicates that 0.51% of variance in birthweight is dependent upon the mother’s age and father’s age. This implies that these variables have little to no correlation to the birthweight of a baby.

Conclusions

Based on the data, the variables of mother’s age, father’s age, and the mother’s smoking habit are all poor predictors of the health of a baby concerning birthweight and if the baby is likely to be born prematurely.

Limitations

The most significant limitation of this data is the lack of variables focusing on the health of the baby. The only information present shows whether the baby had a healthy birthweight, and if the pregnancy lasted long enough for the baby to not be born prematurely. There were no variables on the health of the baby outside of these two factors, such as birth defects or whether a health condition was present. If there were other variables of the baby’s condition to analyze, it may have been found that the age of a parent or if the baby was carried by a smoking mother have an impact on the overall health of the baby. However, because no such variable is present, it appears that the variables of parents’ age and smoking have no impact, even though in reality they may have an effect on other factors of the baby’s health.

Another limitation is that this data does not express any information about miscarriages. There is no way to tell how many other parents were going to have a child but lost their baby, potentially due to one of the variables of age or smoking habit. If parents who lost their child had been included in the sample data, more conclusions may have been drawn about the impacts of these variables.


This document was produced as a final project for MAT 143H - Introduction to Statistics (Honors) at North Shore Community College.

The course was led by Professor Billy Jackson.

Student Name: Ryan Duggan

Semester: Spring 2018