In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. The objective of this project is to study the relation between habits and practices of expectant mothers and the birth of their children.
The data can be found in the openintro package, and it’s called ncbirths.
Start with loading the packages: tidyverse, tidymodels, openintro
#viewing the dataset
summary(ncbirths)
## fage mage mature weeks premie
## Min. :14.00 Min. :13 mature mom :133 Min. :20.00 full term:846
## 1st Qu.:25.00 1st Qu.:22 younger mom:867 1st Qu.:37.00 premie :152
## Median :30.00 Median :27 Median :39.00 NA's : 2
## Mean :30.26 Mean :27 Mean :38.33
## 3rd Qu.:35.00 3rd Qu.:32 3rd Qu.:40.00
## Max. :55.00 Max. :50 Max. :45.00
## NA's :171 NA's :2
## visits marital gained weight
## Min. : 0.0 not married:386 Min. : 0.00 Min. : 1.000
## 1st Qu.:10.0 married :613 1st Qu.:20.00 1st Qu.: 6.380
## Median :12.0 NA's : 1 Median :30.00 Median : 7.310
## Mean :12.1 Mean :30.33 Mean : 7.101
## 3rd Qu.:15.0 3rd Qu.:38.00 3rd Qu.: 8.060
## Max. :30.0 Max. :85.00 Max. :11.750
## NA's :9 NA's :27
## lowbirthweight gender habit whitemom
## low :111 female:503 nonsmoker:873 not white:284
## not low:889 male :497 smoker :126 white :714
## NA's : 1 NA's : 2
##
##
##
##
There are 1000 cases in this sample. A case represents the number of participants or mothers in this experiment.
A 1995 study suggests that average weight of Caucasian babies born in the US is 3,369 grams (7.43 pounds). In this dataset, we only have information on the mother’s race, so we will make the simplifying assumption that babies of Caucasian mothers are also Caucasian, i.e. whitemom = “white”.
We want to evaluate whether the average weight of Caucasian babies has changed since 1995.
Our null hypothesis should state “there is nothing going on”, i.e. no change since 1995:
H_0: μ=7.43 pounds.
Our alternative hypothesis should reflect the research question, i.e. some change since 1995. Since the research question doesn’t state a direction for the change, we use a two-sided alternative hypothesis:
H_A: μ≠7.43 pounds.
Create a filtered data frame called ncbirths_white that
contains data only from white mothers. Then, calculate the mean of the
weights of their babies.
ncbirths_white <-(
ncbirths%>%
filter(whitemom == "white")
)
ncbirths_white%>%
summarize(mean = mean(weight))
## # A tibble: 1 × 1
## mean
## <dbl>
## 1 7.25
The mean weight calculated is 7.25.
#Simulation-based Inference:
The conditions necessary to make a simulation based inference are met. The sample size is large enough to make random sample sizes. Secondly, the calculation is accurate enough that the observed data could have been derived from an average of runs of the experiment.Therefore, the predicted values of the observed data can be used to make conclusions of a larger sample size or the total births in a specific area.
Our goal is to simulate a null distribution of sample means that is centered at the null value of 7.43 pounds.
In order to do so, we
• take a bootstrap sample of from the original sample,
• calculate this bootstrap sample’s mean,
• repeat these two steps a large number of times to create a bootstrap distribution of means centered at the observed sample mean,
• shift this distribution to be centered at the null value by subtracting / adding X to all bootstrap mean (X = difference between mean of bootstrap distribution and null value), and
• calculate the p-value as the proportion of bootstrap samples that yielded a sample mean at least as extreme as the observed sample mean.
From the visualization above, a higher sample size of 1000 reduces
variability in the data and the histogram more accurately reflects the
distribution of the population.
The hypothesis test:
#Run the appropriate hypothesis test, visualize the null distribution, calculate the p-value, and interpret the results in context of the data and the hypothesis test.
# Specify the null hypothesis mu = 7.43 pounds
null_dist <- ncbirths_white %>%
specify(response = weight) %>%
hypothesize(null = "point", mu = 7.43) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean")
# Plot the histogram of the null distribution
ggplot(data = null_dist, mapping = aes(x = stat)) +
geom_histogram(binwidth = 0.01) +
labs(title = "Null Distribution of Means",
x = "Mean Weight (pounds)", y = "Frequency") +
theme_minimal()
# Calculate the p-value
p_value <- null_dist %>%
filter(stat <= (mean(ncbirths_white$weight))) %>%
summarise(p_value = n() / nrow(null_dist)) %>%
pull(p_value)
# Print the p-value
print(p_value)
## [1] 0
Since the p-value is below 0.05, we can reject the null hypothesis and conclude that the average weight of Caucasian babies has changed since 1955.
Next, consider the possible relationship between a mother’s smoking habit and the weight of her baby. Make side-by-side boxplots displaying the relationship between habit and weight.
# Create boxplot
boxplot(ncbirths$weight ~ ncbirths$habit, main= "Weight of the Babies by Smoking Habits of the Mothers",
xlab = "Weight of the Babies", ylab = "Smoking Habits of the Mothers",
horizontal = TRUE)
From the box plot, we see that the average baby weight for mothers who
are non smokers is slightly higher than the smokers. There is also
significanly more data for the non smokers.
Before moving forward, save a version of the dataset omitting
observations where there are NAs for habit and call this version
ncbirths_habitgiven.
ncbirths_habitgiven <-(
ncbirths%>%
filter(habit != "NA"))
The box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the following to first group the data by the habit variable, and then calculate the mean weight in these groups.
#calculate the mean weight
ncbirths_habitgiven %>%
group_by(habit) %>% summarise(mean_weight = mean(weight))
## # A tibble: 2 × 2
## habit mean_weight
## <fct> <dbl>
## 1 nonsmoker 7.14
## 2 smoker 6.83
We see the the average weight from non smoking mothers (7.14) is higher than that of smoking mothers (6.83)
There is an observed difference, but is this difference statistically significant? In order to answer this question we will conduct a hypothesis test.
Null Hypothesis: There is no difference in means of the weights of babies born between the smoking and non-smoking mothers.
Alternative Hypothesis: There is a difference in the average weights of babies born between the smoking and non-smoking mothers.
Both variables have a sample size larger than 30, therefore meeting the conditions necessary to conduct a simulation based inference:
# Count the number of observations for each smoking habit
ncbirths_habitgiven%>%
count(habit)
## # A tibble: 2 × 2
## habit n
## <fct> <int>
## 1 nonsmoker 873
## 2 smoker 126
Construct a 95% confidence interval for the difference between the average weights of babies born to smoking and non-smoking mothers.
inference(y = ncbirths_habitgiven$weight, x = ncbirths_habitgiven$habit, data= ncbirths, statistic = "mean", type = "ht", null = 0,
alternative = "twosided", method = "theoretical")
## Response variable: numerical
## Explanatory variable: categorical (2 levels)
## n_nonsmoker = 873, y_bar_nonsmoker = 7.1443, s_nonsmoker = 1.5187
## n_smoker = 126, y_bar_smoker = 6.8287, s_smoker = 1.3862
## H0: mu_nonsmoker = mu_smoker
## HA: mu_nonsmoker != mu_smoker
## t = 2.359, df = 125
## p_value = 0.0199
Since the p_value is below 0.05, we can reject the null hypothesis that
there is no difference between the means of weights of babies born from
both mother groups. We can conclude that the average baby weight from a
non smoker is higher than the average baby weight from a smoking
mother
In this portion of the analysis we focus on two variables. The first one is maturemom. First, a non-inference task to determine the age cutoff for younger and mature mothers.
ncbirths %>%
group_by(mature) %>% summarise(max_weight = max(mage))
## # A tibble: 2 × 2
## mature max_weight
## <fct> <int>
## 1 mature mom 50
## 2 younger mom 34
The cut off age for young mothers is 34. This method to determine the age cutoff for younger and mature mothers works because it gives the maximum age of a young mother.
A hypothesis test to evaluate whether the proportion of low birth weight babies is higher for mature mothers.
ncbirths_low <-(
ncbirths%>%
filter(lowbirthweight == "low")
)
The null hypothesis: there is no difference in the proportion of low birth weights between mature and younger mothers.
The alternative hypothesis: the proportion of low birth weights is higher for mature mothers than younger mothers.
inference(ncbirths_low$weight, ncbirths_low$mature,data= ncbirths, statistic = "mean", type = "ht", null = 0,
alternative = "greater", method = "theoretical")
## Response variable: numerical
## Explanatory variable: categorical (2 levels)
## n_mature mom = 18, y_bar_mature mom = 3.9989, s_mature mom = 1.3713
## n_younger mom = 93, y_bar_younger mom = 4.0417, s_younger mom = 1.3807
## H0: mu_mature mom = mu_younger mom
## HA: mu_mature mom > mu_younger mom
## t = -0.1212, df = 17
## p_value = 0.5475
The p value of 0.547 shows that there is no significant difference between the weights of the mothers thus we cannot reject the null hypothesis.