HW 03 - Smoking during pregnacy

In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.

Remark: The first think you should do, is try to create the HTML output by clicking on the Knit button.
Then work on each of the exercises (one after the other) and when you are done with one of them, remove the code chunk option eval=FALSE for that code chunk (in case it’s there). Otherwise the code of that chunk will not be evaluated and hence the result would not be included in the HTML output. Replace all the placeholder ___ by appropriate text/code.
Try to create the HTML output whenever you are done with one of the exercises - You can remove this remark when you are done with writing your solution.

Packages

install.packages("tidyverse")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("infer")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("openintro")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

We’ll use the tidyverse package for much of the data wrangling and visualisation, the infer package for inference, and the data lives in the openintro package.

If you obtain an error after removing the option eval = FALSE in the following code chunk, you have to install at least one of the three packages.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(infer)
library(openintro)

## Loading required package: airports

## Loading required package: cherryblossom

## Loading required package: usdata

Data

The data can be found in the openintro package, and it’s called ncbirths. Since the dataset is distributed with the package, we don’t need to load it separately; it becomes available to us when we load the package. You can find out more about the dataset by inspecting its documentation, which you can access by running ?ncbirths in the Console or using the Help menu in RStudio to search for ncbirths.

You can also find this information here.

Set a seed!

In this homework we’ll be generating random samples. The last thing you want is those samples to change every time you knit your document. So, you should set a seed. There’s an R chunk in your R Markdown file set aside for this. Locate it and add a seed. Make sure all members in a team are using the same seed so that you don’t get merge conflicts and your results match up for the narratives.

Exercises

Exercise 1: What are the cases in this data set? How many cases are there in our sample?

Answer:

The cases in this sample are expecting mothers from North Carolina and the births of their children. There are 1000 cases in the sample.

Baby weights

A 1995 study suggests that average weight of Caucasian babies born in the US is 3,369 grams (7.43 pounds). In this dataset we only have information on mother’s race, so we will make the simplifying assumption that babies of Caucasian mothers are also Caucasian, i.e. whitemom = "white".

We want to evaluate whether the average weight of Caucasian babies has changed since 1995.

Exercise 2: Create a filtered data frame called ncbirths_white that contain data only from white mothers. Then, calculate the mean of the weights of their babies.

Answer:

ncbirths_white <- ncbirths %>%
  filter(whitemom == "white")

ncbirths_white %>%
  summarize(mean = mean(weight))

## # A tibble: 1 × 1
##    mean
##   <dbl>
## 1  7.25

Exercise 3: The variable of interest is the weight of the babies. Simulation based inference is necessary, if the distribution of weight is non-normal. Decide if the distribution of weight is nearly normal or not. Create an appropriate plot to explain your reasoning.

Answer: Simulation based Inference is necessary, as the distribution of weight does not follow the linear theorethical quantiles, thus it is non-normal. One can see that for higher values the data follows the center line, however towards the lower values it tends to be quite far away from a normal distribution.

ggplot(ncbirths_white, aes(sample = weight)) + 
  stat_qq()+stat_qq_line()

Let’s discuss how a test of our assumption about the weight of Caucasian babies would work. First we have to specify the null and alternative hypotheses.

Exercise 4: State the null and alternative hypotheses to test: whether the average weight of Caucasian babies has changed since 1995.

Answer:

\(H_0:\) Average weight of caucasian babies has not changed since 1995 (mu = 7.43)

\(H_1:\) Average weight of caucasian babies has changed since 1995 ()

The next step is now to simulate a null distribution of 1000 sample means that is centred at the null value of the average weight. But before we can do that, we have to set a seed.

Exercise 5: Set the seed equal to 123456 (arbitrary choice).

Answer:

set.seed(123456)

Exercise 6: Run the appropriate simulation based hypothesis test at a significance level of 0.05 using functions from the infer package. Simulate the null distribution based on 1000 sample means, visualize the null distribution, calculate the p-value, and interpret the results in context of the data and the hypothesis test.

Answer: We observe a p value of 0. At alpha = 0.05 one can say that due to the smaller p- value, we have evidence against the null hypothesis and therefore reject the hypothesis. ___

null_dist <- ncbirths_white %>%
  specify(response = weight) %>%
  hypothesize(null = "point", mu = 7.43) %>%
  generate(reps=1000,type="bootstrap") %>%
  calculate(stat = "mean")

Now we can visualize the null distribution

visualize(null_dist)

shade_p_value(obs_stat = 7.250462,direction = "less")

## A p-value shading layer.

To compute the p-value, ___

null_dist %>%
  get_pvalue(obs_stat = 7.250462,direction="less")

## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

Baby weight vs. smoking

Consider the possible relationship between a mother’s smoking habit and the weight of her baby. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.

Exercise 7: Make side-by-side boxplots displaying the relationship between habit and weight (for the whole dataset). What does the plot highlight about the relationship between these two variables?

Answer:

The plot highlights the difference in average weight between non-smoking and smoking mothers. One can clearly see, that babies of non-smoking mothers weigh more on average than babies of smoking mothers. The mean of smoking mothers’ babies is about 6.9 pounds and the mean of non-smokig mothers’ babies is about 7.3 pounds. The average weight of babies whose mothers did not give an answer lies at 3.6 pounds. One can also see that overall, outliers tend to have a lower weight value compared to only two outliers that show a rather high weight.

ggplot(ncbirths, aes(x=weight,y=habit)) +
  geom_boxplot()

Exercise 8: Before moving forward, save a version of the dataset omitting observations where there are NAs for habit. You can call this version ncbirths_habitgiven.

Answer:

ncbirths_habitgiven <- ncbirths %>%
  filter(habit == "smoker" | habit == "nonsmoker")

The box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the following to first group the data by the habit variable, and then calculate the mean weight in these groups using.

ncbirths_habitgiven %>%
  group_by(habit) %>%
  summarise(mean_weight = mean(weight))

## # A tibble: 2 × 2
##   habit     mean_weight
##   <fct>           <dbl>
## 1 nonsmoker        7.14
## 2 smoker           6.83

There is an observed difference, but is this difference statistically significant? In order to answer this question we will conduct a hypothesis test .

Exercise 9: Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.

Answer:

\(H_0\): The average weight of babaies born to smoking and non-smoking mothers equals mean(habit=smoking) = mean (habit=smoking)

\(H_1\): The average weight of babies born to smoking and non-smoking mothers differs mean(habit=smoking) ≠ mean (habit=smoking)

Exercise 10: Run a simulation based hypothesis test at a significance level of 0.05 using functions from the infer package to test the null against the alternative hypothesis from Exercise 9. Simulate the null distribution based on 1000 resamples and then calculate the p-value, and interpret the results in context of the data and the hypothesis test.

Hint: Think about a suitable test statistic. Remember, that instead of assuming the mean values in both groups being the same, we could also assume that their difference is zero.

Answer:

null_dist_habit <- ncbirths_habitgiven %>% 
  specify(formula = weight ~ habit) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("smoker","nonsmoker"))

To calculate the p-value, we need again the value of the test statistic observed for the given dataset.___

obs_diff <- ncbirths_habitgiven %>% 
  specify(formula = weight ~ habit) %>%
  calculate(stat = "diff in means", order = c("smoker","nonsmoker"))
  obs_diff

## Response: weight (numeric)
## Explanatory: habit (factor)
## # A tibble: 1 × 1
##     stat
##    <dbl>
## 1 -0.316

Now we are able to compute the p-value

null_dist_habit %>%
  get_pvalue(obs_diff, direction = "both")

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1   0.028

We observe a p-value of 0.04 which is smaller than our alpha-value of 0.05. Therefore we can reject the null hypothesis, meaning that the average weight of babies born to smokers does indeed differ from the average weight of babies born to non-smokers. ___

Baby weight vs. mother’s age

In this portion of the analysis we focus on two variables (from the complete dataset). The first one is mature, which has two levels "mature mom" and "younger mom".

Exercise 11: First, a non-inference task: Determine the age cutoff for younger and mature mothers. Use the quantile() function to determine the age, such that the correct amount of younger mothers has an age being less or equal than that value.

Answer:

Let’s first find out how many of 1000 mothers are younger moms. ___

ncbirths %>%
  summarise(sum = sum(mature == "younger mom"))

## # A tibble: 1 × 1
##     sum
##   <int>
## 1   867

This means, we are interested in the 0.867 quantile of the variable mage, which contains the actual age of the mothers.

quantile(ncbirths$mage, probs=0.867)

##  86.7% 
## 34.133

The other variable of interest is lowbirthweight.

Exercise 12: Conduct a hypothesis test evaluating whether the proportion of low birth weight babies is higher for mature mothers. State the hypotheses, run the simulation based test for 1000 resamples and calculate the p-value, and state your conclusion in context of the research question. Use \(\alpha = 0.05\). If you find a significant difference, construct a confidence interval, at the equivalent level to the hypothesis test, for the difference between the proportions of low birth weight babies between mature and younger mothers, and interpret this interval in context of the data.

Answer:

\(H_0:\) The proportion of low birth weight babies from younger mothers and mature mothers does not differ

\(H_1:\) The proportion of low birth weight babies from is higher for mature mothers than for younger mothers.

null_dist_bweight <- ncbirths %>%
  specify(success = "low", formula = lowbirthweight ~ mature)%>%
  hypothesize(null = "independence")%>%
  generate(reps = 1000, type = "permute")%>%
  calculate(stat = "diff in props", order = c("mature mom", "younger mom"))

The observed test statistic value is equal to

obs_diff_prop <- ncbirths %>%
  specify(success = "low",formula = lowbirthweight ~ mature)%>%
  calculate(stat = "diff in props", order = c("mature mom", "younger mom"))
  obs_diff_prop

## Response: lowbirthweight (factor)
## Explanatory: mature (factor)
## # A tibble: 1 × 1
##     stat
##    <dbl>
## 1 0.0281

and the resulting p-value is equal to

null_dist_bweight %>%
  get_pvalue(obs_diff_prop, direction = "greater")

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1    0.23

We observe a p-value of 0.18 which higher than our alpha-value of 0.05. Therefore we can accept the null hypothesis as true, which means that the proportion of low weight babies born to younger mothers does not differ from the proportion of low weight babies born to mature mothers.