This project aims to research how do the mother’s age, father’s age, and the mother’s smoking habits affect the health of a baby, specifically the baby’s birthweight and whether or not it was born prematurely.
I will be using a dataset called “ncbirths” from the open intro library. Data set on 1,000 randomly sampled births from the birth records released by the state of North Carolina in 2004. This data set has been of interest to medical researchers who are studying the relationship between habits and practices of expectant mothers and the birth of their children. I must analyze if the variable that was studied in ncbirths are they related to each other, how strong that relationship is?
This project will look at the ncbirths dataset, which contains information on 1,000 birth cases recorded in the state of North Carolina. It was released to the public in 2004.
#Storing the dataset into the environment
library(openintro)
## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
## The following objects are masked from 'package:datasets':
##
## cars, trees
ncbirths <- ncbirths
# View the structure of dataset
str(ncbirths)
## 'data.frame': 1000 obs. of 13 variables:
## $ fage : int NA NA 19 21 NA NA 18 17 NA 20 ...
## $ mage : int 13 14 15 15 15 15 15 15 16 16 ...
## $ mature : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
## $ weeks : int 39 42 37 41 39 38 37 35 38 37 ...
## $ premie : Factor w/ 2 levels "full term","premie": 1 1 1 1 1 1 1 2 1 1 ...
## $ visits : int 10 15 11 6 9 19 12 5 9 13 ...
## $ marital : Factor w/ 2 levels "married","not married": 1 1 1 1 1 1 1 1 1 1 ...
## $ gained : int 38 20 38 34 27 22 76 15 NA 52 ...
## $ weight : num 7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
## $ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
## $ gender : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
## $ habit : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...
## $ whitemom : Factor w/ 2 levels "not white","white": 1 1 2 2 1 1 1 1 2 2 ...
The dataset contains 1,000 randomly sampled of births. In these observations, five (5) variables were recorded which were the mother’s and, father’s age (in year), mother’s smoking habit, length of pregnancy (in week) and the birth of the baby (in pounds)
# Preview of first few lines of data
head(ncbirths)
## fage mage mature weeks premie visits marital gained weight
## 1 NA 13 younger mom 39 full term 10 married 38 7.63
## 2 NA 14 younger mom 42 full term 15 married 20 7.88
## 3 19 15 younger mom 37 full term 11 married 38 6.63
## 4 21 15 younger mom 41 full term 6 married 34 8.00
## 5 NA 15 younger mom 39 full term 9 married 27 6.38
## 6 NA 15 younger mom 38 full term 19 married 22 5.38
## lowbirthweight gender habit whitemom
## 1 not low male nonsmoker not white
## 2 not low male nonsmoker not white
## 3 not low female nonsmoker white
## 4 not low male nonsmoker white
## 5 not low female nonsmoker not white
## 6 low male nonsmoker not white
summary(ncbirths)
## fage mage mature weeks
## Min. :14.00 Min. :13 mature mom :133 Min. :20.00
## 1st Qu.:25.00 1st Qu.:22 younger mom:867 1st Qu.:37.00
## Median :30.00 Median :27 Median :39.00
## Mean :30.26 Mean :27 Mean :38.33
## 3rd Qu.:35.00 3rd Qu.:32 3rd Qu.:40.00
## Max. :55.00 Max. :50 Max. :45.00
## NA's :171 NA's :2
## premie visits marital gained
## full term:846 Min. : 0.0 married :386 Min. : 0.00
## premie :152 1st Qu.:10.0 not married:613 1st Qu.:20.00
## NA's : 2 Median :12.0 NA's : 1 Median :30.00
## Mean :12.1 Mean :30.33
## 3rd Qu.:15.0 3rd Qu.:38.00
## Max. :30.0 Max. :85.00
## NA's :9 NA's :27
## weight lowbirthweight gender habit
## Min. : 1.000 low :111 female:503 nonsmoker:873
## 1st Qu.: 6.380 not low:889 male :497 smoker :126
## Median : 7.310 NA's : 1
## Mean : 7.101
## 3rd Qu.: 8.060
## Max. :11.750
##
## whitemom
## not white:284
## white :714
## NA's : 2
##
##
##
##
By looking at the summary the father’s age ranges from 14 years to 55 years, the mother’s age range from 13 to 50 years, the pregnancy length ranges from 20 to 45 weeks and the weight of the baby ranges from 1 to 11.8 pounds
#Seeing how many mothers were smokers and nonsmokers
table(ncbirths$habit)
##
## nonsmoker smoker
## 873 126
# Observing the summary and sample size of the father's age variable
summary(ncbirths$mage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13 22 27 27 32 50
table(is.na(ncbirths$mage))
##
## FALSE
## 1000
# Observing the summary and sample size of the father's age variable
summary(ncbirths$fage)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 14.00 25.00 30.00 30.26 35.00 55.00 171
table(is.na(ncbirths$fage))
##
## FALSE TRUE
## 829 171
Plotting pregnancy length in correlation to father’s age, mother’s age, and mother’s smoking habits
To see if there is a relationship between the two variables, we should look at a scatterplot of the data.
# Creating a scatterplot for the variable mother's age as explanatory variable(X) and length of pregnancy as response variable(Y)
plot(weeks ~ mage, data = ncbirths, main="Mother's Age and Pregnancy Length", xlab="Mother's Age", ylab = "Pregnancy Length in Weeks")
# Creating a scatterplot for the variable father's age as explanatory variable(X) and length of pregnancy as response variable(Y)
plot(weeks ~ fage, data = ncbirths, main="Father's Age and Pregnancy Length", xlab="Father's Age", ylab = "Pregnancy Length in Weeks")
#Creating a box plot for the variables mother's smoking habit and pregnancy length
plot(weeks ~ habit, data = ncbirths, main="Mother's Smoking Habit and Pregnancy Length", xlab="Smoking Habit of Mother", ylab = "Pregnancy Length in Weeks")
Correlation Coefficients for the set of variables
#Finding the correlation coefficients for the sets of variables
cor(ncbirths$mage, ncbirths$weeks, use = "pairwise.complete.obs")
## [1] -0.03208743
cor(ncbirths$fage, ncbirths$weeks, use = "pairwise.complete.obs")
## [1] -0.01607193
Looking at the scatterplots for father’s age and pregnancy length as well as for mother’s age and pregnancy length, it appears that neither set of variables has a visibly strong correlation. By looking at the box plot of the birthweights for babies of smoking and nonsmoking mothers, it appears that the average birthweight is about the same for both groups. Although the nonsmoking mothers have a wider range of birthweights for their children, this could be due to the larger sample size of nonsmokers.
Plotting the birthweight of a baby in correlation to father’s age, mother’s age, and mother’s smoking habits
# Creating a scatterplot for mother's age and babyweight
plot(weight ~ mage, data = ncbirths, main = "Mother's Age and Baby's Birthweight", xlab="Mother's Age", ylab="Baby's Birthweight")
Looking at the scatterplots for mother’s age and baby’s birthweight there is a relatively weak linear relationship between the mother age and the birth weight.
#Creating a scatterplot for father's age and babyweight
plot(weight ~ fage , data = ncbirths, main = "Father's Age and Baby's Birthweight", xlab="Father's Age", ylab="Baby's Birthweight")
Looking at the scatterplots for father’s age and baby’s birthweight, there is a relatively weak linear relationship between the mother age and the birth weight.
# Creatiing a side-by-side boxplot of habit and weight
boxplot(weight ~ habit,data=ncbirths, main="Relation Between Mother's Smoking Habit and Baby's Weight",
ylab="Baby's Weight", xlab="Mother Smoker/Non-Smoker")
Interestingly we see that non-smoker has a lot more variance when it comes to weight. This is in addition to a higher median weight.
The box plot of mother’s smoking habit and baby’s birthweight shows that the average birthweight for babies of smokers is only slightly below that of nonsmoking mothers’ chidren. Although the nonsmoking mothers have a wider range of birthweights for their children with more outliers, this could be due to the larger sample size of nonsmokers.
Correlation Coefficients for the set of variables
#Finding the correlation coefficients for the sets of variables
cor(ncbirths$mage, ncbirths$weight, use = "pairwise.complete.obs")
## [1] 0.05506589
cor(ncbirths$fage, ncbirths$weight, use = "pairwise.complete.obs")
## [1] 0.07023403
## Creating a linear model for the variables of mother's age and father's age
lm_weeks.ages <- lm(weeks ~ mage + fage, data=ncbirths)
lm_weeks.ages
##
## Call:
## lm(formula = weeks ~ mage + fage, data = ncbirths)
##
## Coefficients:
## (Intercept) mage fage
## 38.747494 -0.022717 0.009309
The regression equation for this model is: \(y = 38.747494 ??? 0.022717(x1) + 0.009309(x2)\)
# Finding the R-squared value with the summary function
summary(lm_weeks.ages)$r.square
## [1] 0.001178112
The multiple R-squared value of 0.001178 indicates that approximately 0.12% of variance in the length of time a pregnancy lasts is dependent upon the mother’s age and father’s age. This implies that these variables have little to no correlation to the duration of a pregancy.
Multivariate Regression - Birthweight Mother’s Age and Father’s Age
# Creating a linear model for the variables of mother's age and father's age
lm_weight.ages <- lm(weight ~ mage + fage, data=ncbirths)
lm_weight.ages
##
## Call:
## lm(formula = weight ~ mage + fage, data = ncbirths)
##
## Coefficients:
## (Intercept) mage fage
## 6.686128 0.005144 0.011728
The regression equation for this model is: \(y = 6.686128 + 0.005144(x1) + 0.011728(x2)\)
# Finding the R-squared value with the summary function
summary(lm_weight.ages)$r.squared
## [1] 0.005112105
The multiple R-squared value of 0.005112 indicates that 0.51% of variance in birthweight is dependent upon the mother’s age and father’s age. This implies that these variables have little to no correlation to the birthweight of a baby.
# Smokers subset
ncsmokers <- subset(ncbirths, ncbirths$habit == "smoker")
# Nonsmokers subset
ncnonsmokers <- subset(ncbirths, ncbirths$habit == "nonsmoker")
# Mean difference in weights (x1 - x2)
mean(ncsmokers$weight) - mean(ncnonsmokers$weight)
## [1] -0.3155425
summary(ncsmokers$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.690 6.077 7.060 6.829 7.735 9.190
summary(ncnonsmokers$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 6.440 7.310 7.144 8.060 11.750
The plot highlights the association between the two variables. From the plot we see that babies born to non-smokers on average have higher weights than those born to smokers.(Median weight is lower for smokers) So habits seems to affect birth weight.
# Sample sizes to check the conditions
by(ncbirths$weight, ncbirths$habit, length)
## ncbirths$habit: nonsmoker
## [1] 873
## --------------------------------------------------------
## ncbirths$habit: smoker
## [1] 126
The hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.
\(Ho: mu_smoking = mu_nonsmoking\) \(H0 : There is no significant difference between the means of children born from smoking versus non-smoking mothers.\)
\(Ho: mu_smoking \neq mu_nonsmoking\) \(HA : There is a significant difference between means.\)
# store standard error
n1 <- 126
n2 <- 873
se <- sqrt((sd(ncsmokers$weight)^2/n1) + (sd(ncnonsmokers$weight)^2/n2))
# test statisti
ts <- (mean(ncsmokers$weight) - mean(ncnonsmokers$weight))/se
# p-value
pt(ts, df = 125)
## [1] 0.009936814
Make decision to reject H0 or fail to reject H0 at a significance level of 0.05 based on p-value.
p-value is less than alpha, therefore we reject the H0
Based on the data, the variables of mother’s age, father’s age, and the mother’s smoking habit are all poor predictors of the health of a baby concerning birthweight and if the baby is likely to be born prematurely. From my p-value the results we do see that baby’s from smoking birth mothers will not have the same average weight as baby’s from non-smoking birth mothers.
The most significant limitation of this data is the lack of variables focusing on the health of the baby. The only information present shows whether the baby had a healthy birthweight, and if the pregnancy lasted long enough for the baby to not be born prematurely. There were no variables on the health of the baby outside of these two factors, such as birth defects or whether a health condition was present. If there were other variables of the baby’s condition to analyze, it may have been found that the age of a parent or if the baby was carried by a smoking mother have an impact on the overall health of the baby. However, because no such variable is present, it appears that the variables of parents’ age and smoking have no impact, even though in reality they may have an effect on other factors of the baby’s health.
Another limitation is that this data does not express any information about miscarriages. There is no way to tell how many other parents were going to have a child but lost their baby, potentially due to one of the variables of age or smoking habit. If parents who lost their child had been included in the sample data, more conclusions may have been drawn about the impacts of these variables.
This document was produced as a final project for MAT 143H - Introduction to Statistics (Honors) at North Shore Community College.
The course was led by Professor Billy Jackson.
Student Name:
Semester: