In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set. Many of the commands in the lab are repeats from previous labs. You may need to look back at old labs to refresh yourself on the commands.
Load the nc data set.
download.file("http://www.openintro.org/stat/data/nc.RData", destfile = "nc.RData")
load("nc.RData")
We have observations on 13 different variables, some categorical and some numerical. The meaning of each variable is as follows.
| variable | description |
|---|---|
fage |
father’s age in years. |
mage |
mother’s age in years. |
mature |
maturity status of mother. |
weeks |
length of pregnancy in weeks. |
premie |
whether the birth was classified as premature (premie) |
| or full-term. | |
visits |
number of hospital visits during pregnancy. |
marital |
whether mother is married or not married
at birth. |
gained |
weight gained by mother during pregnancy in pounds. |
weight |
weight of the baby at birth in pounds. |
lowbirthweight |
whether baby was classified as low birthweight
(low) |
or not (not low). |
|
gender |
gender of the baby, female or male. |
habit |
status of the mother as a nonsmoker or a
smoker. |
whitemom |
whether mom is white or not white. |
Now do Exercise 1.
Consider the possible relationship between a mother’s smoking habit and the weight of her baby. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.
Now do Exercise 2.
The box plots show how the medians of the two distributions compare,
but we can also compare the means of the distributions using the
following function to split the weight variable into the
habit groups, then take the mean of each using the
mean function.
by(nc$weight, nc$habit, mean, na.rm = TRUE)
There is an observed difference, but is this difference statistically significant? In order to answer this question we will conduct a hypothesis test.
Now do Exercise 3.
Next, we use R’s built-in t.test() function to conduct a
hypothesis test for the difference in mean birth weight between the two
habit groups. Enter the following command in a code chunk as part of
Exercise 4.
t.test(weight ~ habit, data = nc, alternative = "two.sided")
Let’s walk through the arguments of t.test(). The first
argument uses R’s formula syntax: weight ~ habit means
“compare weight across the groups defined by
habit.” The data = nc argument tells R which
data frame to use. The alternative argument specifies the
direction of the alternative hypothesis; here "two.sided"
means we are testing for any difference in either direction. The output
includes the test statistic, degrees of freedom, p-value, and a 95%
confidence interval for the difference in means.
The commented scaffold in your template shows the structure of the
call. Remove the comment characters (#) and fill in the
variable names to run the test.
Now do Exercise 4.
Now do Exercise 5.
Notice that t.test() produces both a p-value and a
confidence interval in the same output. In Exercise 5 you are asked to
read off and interpret the confidence interval that already appears in
the Exercise 4 output. You do not need to run a separate command.
Now do Exercise 6.
For Exercise 6 we test whether the average weight gained during pregnancy differs between younger and mature mothers. The structure of the call is identical to Exercise 4, but with different variable names. Use the commented scaffold in the template as your starting point.
So far in class, we have performed hypothesis tests for one
population or for two populations, but sometimes we may be interested in
comparing more than two populations. Consider the variable
mature. It is a categorical variable with two categories:
mature mom and younger mom. Suppose, however,
that we are interested in introducing a third category
teen mom. To this end, we introduce a new column to the
dataset nc titled mature.new. The code is
shown below. Put it in a code chunk in your notebook as part of Exercise
7.
nc$mature.new <- as.character(nc$mature)
nc$mature.new[nc$mage < 20] <- "teen mom"
nc$mature.new <- as.factor(nc$mature.new)
Note that the variable mature.new is a new variable we
are creating.
Now do Exercise 7.
While side-by-side boxplots may indicate a relationship between two variables, we need to perform a hypothesis test to determine whether this relationship is significant or merely a consequence of random sampling. But how do we perform a hypothesis test with three populations? It may be helpful to start by thinking about what our hypotheses should be. If we let \(\mu_m\), \(\mu_y\), and \(\mu_t\) denote the true means of the mature, younger, and teen mom populations, respectively, then we can state the hypotheses as,
\[\begin{align*} H_0: & \quad \mu_m = \mu_y = \mu_t \\ H_a: & \quad \text{at least one of the means differs from the others.} \end{align*}\]
It may seem intuitive to perform 3 hypothesis tests: teen mom vs. younger mom, teen mom vs. mature mom, and mature mom vs. younger mom. The code for the teen mom vs. younger mom hypothesis test is shown below.
nc.ty <- subset(nc, mature.new %in% c("teen mom", "younger mom"))
nc.ty$mature.new <- droplevels(nc.ty$mature.new)
t.test(weight ~ mature.new, data = nc.ty, alternative = "two.sided")
This code is a bit involved. Let’s walk through it. First, we create
a new data frame called nc.ty (the “t” is for “teen mom”
and the “y” is for “younger mom”). We use subset() to keep
only the rows for teen and younger moms, leaving out the mature mom
data. Next, droplevels() removes any unused factor levels —
in this case it removes the “mature mom” level since we are no longer
using it. Finally, t.test() conducts a hypothesis test for
the difference in mean birth weights between infants of teen mothers and
infants of younger mothers.
Now do Exercise 8.
There is a problem with performing 3 separate hypothesis tests. Each test carries its own chance of a Type I error, and those errors accumulate. When \(\alpha = 0.05\), the probability of not making a Type I error on a single test is 0.95. To correctly conclude that all three means are equal, we would need all three tests to avoid a Type I error simultaneously. Assuming independence, that probability is, \[ (0.95)(0.95)(0.95) = 0.8574. \] This is substantially lower than 95%. Note that this calculation assumes the three tests are independent, which is not exactly true here since they share data, but it gives the right intuition: performing multiple tests inflates the overall Type I error rate and makes us more likely to reject \(H_0\) even when it is true.
Thankfully, there is a specific test designed for exactly this scenario. It is called ANOVA, which stands for Analysis of Variance. The inner workings of ANOVA are a bit beyond the scope of this course. At its core, it is a hypothesis test that lets us compare all group means simultaneously, controlling the Type I error rate at the chosen \(\alpha\). The code below performs ANOVA for the three populations of interest. Include it in a code chunk as part of Exercise 9.
my.anova <- aov(weight ~ mature.new, data = nc)
summary(my.anova)
Now do Exercise 9.
This lab is a modification of a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel. It has been edited extensively by Ben Jackson.