In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.
Remark: The first think you should do, is try to
create the HTML output by clicking on the Knit button.
Then work on each of the exercises (one after the other) and when you
are done with one of them, remove the code chunk option
eval=FALSE for that code chunk (in case it’s there).
Otherwise the code of that chunk will not be evaluated and hence the
result would not be included in the HTML output. Replace all the
placeholder ___ by appropriate text/code.
Try to create the HTML output whenever you are done with one of the
exercises - You can remove this remark when you are done with
writing your solution.
install.packages("tidyverse")## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("infer")## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("openintro")## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
We’ll use the tidyverse package for much of the data wrangling and visualisation, the infer package for inference, and the data lives in the openintro package.
If you obtain an error after removing the option
eval = FALSE in the following code chunk, you have to
install at least one of the three packages.
library(tidyverse) ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(infer)
library(openintro)## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
The data can be found in the openintro package, and
it’s called ncbirths. Since the dataset is distributed with
the package, we don’t need to load it separately; it becomes available
to us when we load the package. You can find out more about the dataset
by inspecting its documentation, which you can access by running
?ncbirths in the Console or using the Help menu in RStudio
to search for ncbirths.
You can also find this information here.
In this homework we’ll be generating random samples. The last thing you want is those samples to change every time you knit your document. So, you should set a seed. There’s an R chunk in your R Markdown file set aside for this. Locate it and add a seed. Make sure all members in a team are using the same seed so that you don’t get merge conflicts and your results match up for the narratives.
Exercise 1: What are the cases in this data set? How many cases are there in our sample?
Answer:
The cases in this sample are expecting mothers from North Carolina and the births of their children. There are 1000 cases in the sample.
A 1995
study suggests that average weight of Caucasian babies born in the
US is 3,369 grams (7.43 pounds). In this dataset we only have
information on mother’s race, so we will make the simplifying assumption
that babies of Caucasian mothers are also Caucasian,
i.e. whitemom = "white".
We want to evaluate whether the average weight of Caucasian babies has changed since 1995.
Exercise 2: Create a filtered data frame called
ncbirths_white that contain data only from white mothers.
Then, calculate the mean of the weights of their babies.
Answer:
ncbirths_white <- ncbirths %>%
filter(whitemom == "white")
ncbirths_white %>%
summarize(mean = mean(weight))## # A tibble: 1 × 1
## mean
## <dbl>
## 1 7.25
Exercise 3: The variable of interest is the
weight of the babies. Simulation based inference is
necessary, if the distribution of weight is non-normal.
Decide if the distribution of weight is nearly normal or
not. Create an appropriate plot to explain your reasoning.
Answer: Simulation based Inference is necessary, as the distribution of weight does not follow the linear theorethical quantiles, thus it is non-normal. One can see that for higher values the data follows the center line, however towards the lower values it tends to be quite far away from a normal distribution.
ggplot(ncbirths_white, aes(sample = weight)) +
stat_qq()+stat_qq_line()Let’s discuss how a test of our assumption about the weight of Caucasian babies would work. First we have to specify the null and alternative hypotheses.
Exercise 4: State the null and alternative hypotheses to test: whether the average weight of Caucasian babies has changed since 1995.
Answer:
\(H_0:\) Average weight of caucasian babies has not changed since 1995 (mu = 7.43)
\(H_1:\) Average weight of caucasian babies has changed since 1995 ()
The next step is now to simulate a null distribution of 1000 sample means that is centred at the null value of the average weight. But before we can do that, we have to set a seed.
Exercise 5: Set the seed equal to 123456 (arbitrary choice).
Answer:
set.seed(123456)Exercise 6: Run the appropriate simulation based
hypothesis test at a significance level of 0.05 using functions from the
infer package. Simulate the null distribution based on 1000
sample means, visualize the null distribution, calculate the p-value,
and interpret the results in context of the data and the hypothesis
test.
Answer: We observe a p value of 0. At alpha = 0.05 one can say that due to the smaller p- value, we have evidence against the null hypothesis and therefore reject the hypothesis. ___
null_dist <- ncbirths_white %>%
specify(response = weight) %>%
hypothesize(null = "point", mu = 7.43) %>%
generate(reps=1000,type="bootstrap") %>%
calculate(stat = "mean")Now we can visualize the null distribution
visualize(null_dist)shade_p_value(obs_stat = 7.250462,direction = "less")## A p-value shading layer.
To compute the p-value, ___
null_dist %>%
get_pvalue(obs_stat = 7.250462,direction="less")## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
Consider the possible relationship between a mother’s smoking habit and the weight of her baby. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.
Exercise 7: Make side-by-side boxplots displaying
the relationship between habit and weight (for
the whole dataset). What does the plot highlight about the relationship
between these two variables?
Answer:
The plot highlights the difference in average weight between non-smoking and smoking mothers. One can clearly see, that babies of non-smoking mothers weigh more on average than babies of smoking mothers. The mean of smoking mothers’ babies is about 6.9 pounds and the mean of non-smokig mothers’ babies is about 7.3 pounds. The average weight of babies whose mothers did not give an answer lies at 3.6 pounds. One can also see that overall, outliers tend to have a lower weight value compared to only two outliers that show a rather high weight.
ggplot(ncbirths, aes(x=weight,y=habit)) +
geom_boxplot()Exercise 8: Before moving forward, save a version of
the dataset omitting observations where there are NAs for
habit. You can call this version
ncbirths_habitgiven.
Answer:
ncbirths_habitgiven <- ncbirths %>%
filter(habit == "smoker" | habit == "nonsmoker")The box plots show how the medians of the two distributions compare,
but we can also compare the means of the distributions using the
following to first group the data by the habit variable,
and then calculate the mean weight in these groups
using.
ncbirths_habitgiven %>%
group_by(habit) %>%
summarise(mean_weight = mean(weight))## # A tibble: 2 × 2
## habit mean_weight
## <fct> <dbl>
## 1 nonsmoker 7.14
## 2 smoker 6.83
There is an observed difference, but is this difference statistically significant? In order to answer this question we will conduct a hypothesis test .
Exercise 9: Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.
Answer:
\(H_0\): The average weight of babaies born to smoking and non-smoking mothers equals mean(habit=smoking) = mean (habit=smoking)
\(H_1\): The average weight of babies born to smoking and non-smoking mothers differs mean(habit=smoking) ≠ mean (habit=smoking)
Exercise 10: Run a simulation based hypothesis test
at a significance level of 0.05 using functions from the
infer package to test the null against the alternative
hypothesis from Exercise 9. Simulate the null distribution based on 1000
resamples and then calculate the p-value, and interpret the results in
context of the data and the hypothesis test.
Hint: Think about a suitable test statistic. Remember, that instead of assuming the mean values in both groups being the same, we could also assume that their difference is zero.
Answer:
null_dist_habit <- ncbirths_habitgiven %>%
specify(formula = weight ~ habit) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("smoker","nonsmoker"))To calculate the p-value, we need again the value of the test statistic observed for the given dataset.___
obs_diff <- ncbirths_habitgiven %>%
specify(formula = weight ~ habit) %>%
calculate(stat = "diff in means", order = c("smoker","nonsmoker"))
obs_diff## Response: weight (numeric)
## Explanatory: habit (factor)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 -0.316
Now we are able to compute the p-value
null_dist_habit %>%
get_pvalue(obs_diff, direction = "both")## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.028
We observe a p-value of 0.04 which is smaller than our alpha-value of 0.05. Therefore we can reject the null hypothesis, meaning that the average weight of babies born to smokers does indeed differ from the average weight of babies born to non-smokers. ___
In this portion of the analysis we focus on two variables (from the
complete dataset). The first one is mature, which has two
levels "mature mom" and "younger mom".
Exercise 11: First, a non-inference task: Determine
the age cutoff for younger and mature mothers. Use the
quantile() function to determine the age, such that the
correct amount of younger mothers has an age being less or equal than
that value.
Answer:
Let’s first find out how many of 1000 mothers are younger moms. ___
ncbirths %>%
summarise(sum = sum(mature == "younger mom"))## # A tibble: 1 × 1
## sum
## <int>
## 1 867
This means, we are interested in the 0.867 quantile of the variable
mage, which contains the actual age of the mothers.
quantile(ncbirths$mage, probs=0.867)## 86.7%
## 34.133
The other variable of interest is lowbirthweight.
Exercise 12: Conduct a hypothesis test evaluating whether the proportion of low birth weight babies is higher for mature mothers. State the hypotheses, run the simulation based test for 1000 resamples and calculate the p-value, and state your conclusion in context of the research question. Use \(\alpha = 0.05\). If you find a significant difference, construct a confidence interval, at the equivalent level to the hypothesis test, for the difference between the proportions of low birth weight babies between mature and younger mothers, and interpret this interval in context of the data.
Answer:
\(H_0:\) The proportion of low birth weight babies from younger mothers and mature mothers does not differ
\(H_1:\) The proportion of low birth weight babies from is higher for mature mothers than for younger mothers.
null_dist_bweight <- ncbirths %>%
specify(success = "low", formula = lowbirthweight ~ mature)%>%
hypothesize(null = "independence")%>%
generate(reps = 1000, type = "permute")%>%
calculate(stat = "diff in props", order = c("mature mom", "younger mom"))The observed test statistic value is equal to
obs_diff_prop <- ncbirths %>%
specify(success = "low",formula = lowbirthweight ~ mature)%>%
calculate(stat = "diff in props", order = c("mature mom", "younger mom"))
obs_diff_prop## Response: lowbirthweight (factor)
## Explanatory: mature (factor)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 0.0281
and the resulting p-value is equal to
null_dist_bweight %>%
get_pvalue(obs_diff_prop, direction = "greater")## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.23
We observe a p-value of 0.18 which higher than our alpha-value of 0.05. Therefore we can accept the null hypothesis as true, which means that the proportion of low weight babies born to younger mothers does not differ from the proportion of low weight babies born to mature mothers.