In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. The objective of this project is to study the relation between habits and practices of expectant mothers and the birth of their children.

The data can be found in the openintro package, and it’s called ncbirths.

Start with loading the packages: tidyverse, tidymodels, openintro

#viewing the dataset
summary(ncbirths)
##       fage            mage            mature        weeks             premie   
##  Min.   :14.00   Min.   :13   mature mom :133   Min.   :20.00   full term:846  
##  1st Qu.:25.00   1st Qu.:22   younger mom:867   1st Qu.:37.00   premie   :152  
##  Median :30.00   Median :27                     Median :39.00   NA's     :  2  
##  Mean   :30.26   Mean   :27                     Mean   :38.33                  
##  3rd Qu.:35.00   3rd Qu.:32                     3rd Qu.:40.00                  
##  Max.   :55.00   Max.   :50                     Max.   :45.00                  
##  NA's   :171                                    NA's   :2                      
##      visits            marital        gained          weight      
##  Min.   : 0.0   not married:386   Min.   : 0.00   Min.   : 1.000  
##  1st Qu.:10.0   married    :613   1st Qu.:20.00   1st Qu.: 6.380  
##  Median :12.0   NA's       :  1   Median :30.00   Median : 7.310  
##  Mean   :12.1                     Mean   :30.33   Mean   : 7.101  
##  3rd Qu.:15.0                     3rd Qu.:38.00   3rd Qu.: 8.060  
##  Max.   :30.0                     Max.   :85.00   Max.   :11.750  
##  NA's   :9                        NA's   :27                      
##  lowbirthweight    gender          habit          whitemom  
##  low    :111    female:503   nonsmoker:873   not white:284  
##  not low:889    male  :497   smoker   :126   white    :714  
##                              NA's     :  1   NA's     :  2  
##                                                             
##                                                             
##                                                             
## 

There are 1000 cases in this sample. A case represents the number of participants or mothers in this experiment.

A 1995 study suggests that average weight of Caucasian babies born in the US is 3,369 grams (7.43 pounds). In this dataset, we only have information on the mother’s race, so we will make the simplifying assumption that babies of Caucasian mothers are also Caucasian, i.e. whitemom = “white”.

We want to evaluate whether the average weight of Caucasian babies has changed since 1995.

Our null hypothesis should state “there is nothing going on”, i.e. no change since 1995:

H_0: μ=7.43 pounds.

Our alternative hypothesis should reflect the research question, i.e. some change since 1995. Since the research question doesn’t state a direction for the change, we use a two-sided alternative hypothesis:

H_A: μ≠7.43 pounds.

Create a filtered data frame called ncbirths_white that contains data only from white mothers. Then, calculate the mean of the weights of their babies.

ncbirths_white <-(
  ncbirths%>%
    filter(whitemom == "white")
)

ncbirths_white%>%
  summarize(mean = mean(weight))
## # A tibble: 1 × 1
##    mean
##   <dbl>
## 1  7.25

The mean weight calculated is 7.25.

#Simulation-based Inference:

The conditions necessary to make a simulation based inference are met. The sample size is large enough to make random sample sizes. Secondly, the calculation is accurate enough that the observed data could have been derived from an average of runs of the experiment.Therefore, the predicted values of the observed data can be used to make conclusions of a larger sample size or the total births in a specific area.

Our goal is to simulate a null distribution of sample means that is centered at the null value of 7.43 pounds.

In order to do so, we

• take a bootstrap sample of from the original sample,

• calculate this bootstrap sample’s mean,

• repeat these two steps a large number of times to create a bootstrap distribution of means centered at the observed sample mean,

• shift this distribution to be centered at the null value by subtracting / adding X to all bootstrap mean (X = difference between mean of bootstrap distribution and null value), and

• calculate the p-value as the proportion of bootstrap samples that yielded a sample mean at least as extreme as the observed sample mean.

From the visualization above, a higher sample size of 1000 reduces variability in the data and the histogram more accurately reflects the distribution of the population.

The hypothesis test:

#Run the appropriate hypothesis test, visualize the null distribution, calculate the p-value, and interpret the results in context of the data and the hypothesis test.

# Specify the null hypothesis mu = 7.43 pounds
null_dist <- ncbirths_white %>%
  specify(response = weight) %>%
  hypothesize(null = "point", mu = 7.43) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean")

# Plot the histogram of the null distribution
ggplot(data = null_dist, mapping = aes(x = stat)) +
  geom_histogram(binwidth = 0.01) +
  labs(title = "Null Distribution of Means",
       x = "Mean Weight (pounds)", y = "Frequency") +
  theme_minimal()

# Calculate the p-value
p_value <- null_dist %>%
  filter(stat <= (mean(ncbirths_white$weight))) %>%
  summarise(p_value = n() / nrow(null_dist)) %>%
  pull(p_value)

# Print the p-value
print(p_value)
## [1] 0

Since the p-value is below 0.05, we can reject the null hypothesis and conclude that the average weight of Caucasian babies has changed since 1955.

Next, consider the possible relationship between a mother’s smoking habit and the weight of her baby. Make side-by-side boxplots displaying the relationship between habit and weight.

# Create boxplot
boxplot(ncbirths$weight ~ ncbirths$habit, main= "Weight of the Babies by Smoking Habits of the Mothers",
        xlab = "Weight of the Babies", ylab = "Smoking Habits of the Mothers",
        horizontal = TRUE)

From the box plot, we see that the average baby weight for mothers who are non smokers is slightly higher than the smokers. There is also significanly more data for the non smokers.

Before moving forward, save a version of the dataset omitting observations where there are NAs for habit and call this version ncbirths_habitgiven.

ncbirths_habitgiven <-(
  ncbirths%>%
    filter(habit != "NA"))

The box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the following to first group the data by the habit variable, and then calculate the mean weight in these groups.

#calculate the mean weight 
ncbirths_habitgiven %>%
  group_by(habit) %>% summarise(mean_weight = mean(weight))
## # A tibble: 2 × 2
##   habit     mean_weight
##   <fct>           <dbl>
## 1 nonsmoker        7.14
## 2 smoker           6.83

We see the the average weight from non smoking mothers (7.14) is higher than that of smoking mothers (6.83)

There is an observed difference, but is this difference statistically significant? In order to answer this question we will conduct a hypothesis test.

Null Hypothesis: There is no difference in means of the weights of babies born between the smoking and non-smoking mothers.

Alternative Hypothesis: There is a difference in the average weights of babies born between the smoking and non-smoking mothers.

Both variables have a sample size larger than 30, therefore meeting the conditions necessary to conduct a simulation based inference:

# Count the number of observations for each smoking habit
ncbirths_habitgiven%>%
  count(habit)
## # A tibble: 2 × 2
##   habit         n
##   <fct>     <int>
## 1 nonsmoker   873
## 2 smoker      126

Construct a 95% confidence interval for the difference between the average weights of babies born to smoking and non-smoking mothers.

inference(y = ncbirths_habitgiven$weight, x = ncbirths_habitgiven$habit, data= ncbirths, statistic = "mean", type = "ht", null = 0,
          alternative = "twosided", method = "theoretical")
## Response variable: numerical
## Explanatory variable: categorical (2 levels) 
## n_nonsmoker = 873, y_bar_nonsmoker = 7.1443, s_nonsmoker = 1.5187
## n_smoker = 126, y_bar_smoker = 6.8287, s_smoker = 1.3862
## H0: mu_nonsmoker =  mu_smoker
## HA: mu_nonsmoker != mu_smoker
## t = 2.359, df = 125
## p_value = 0.0199

Since the p_value is below 0.05, we can reject the null hypothesis that there is no difference between the means of weights of babies born from both mother groups. We can conclude that the average baby weight from a non smoker is higher than the average baby weight from a smoking mother

In this portion of the analysis we focus on two variables. The first one is maturemom. First, a non-inference task to determine the age cutoff for younger and mature mothers.

ncbirths %>%
  
  group_by(mature) %>% summarise(max_weight = max(mage))
## # A tibble: 2 × 2
##   mature      max_weight
##   <fct>            <int>
## 1 mature mom          50
## 2 younger mom         34

The cut off age for young mothers is 34. This method to determine the age cutoff for younger and mature mothers works because it gives the maximum age of a young mother.

A hypothesis test to evaluate whether the proportion of low birth weight babies is higher for mature mothers.

ncbirths_low <-(
  ncbirths%>%
    filter(lowbirthweight == "low")
)

The null hypothesis: there is no difference in the proportion of low birth weights between mature and younger mothers.

The alternative hypothesis: the proportion of low birth weights is higher for mature mothers than younger mothers.

inference(ncbirths_low$weight, ncbirths_low$mature,data= ncbirths, statistic = "mean", type = "ht", null = 0,
          alternative = "greater", method = "theoretical")
## Response variable: numerical
## Explanatory variable: categorical (2 levels) 
## n_mature mom = 18, y_bar_mature mom = 3.9989, s_mature mom = 1.3713
## n_younger mom = 93, y_bar_younger mom = 4.0417, s_younger mom = 1.3807
## H0: mu_mature mom =  mu_younger mom
## HA: mu_mature mom > mu_younger mom
## t = -0.1212, df = 17
## p_value = 0.5475

The p value of 0.547 shows that there is no significant difference between the weights of the mothers thus we cannot reject the null hypothesis.