North Carolina births

In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set. Many of the commands in the lab are repeats from previous labs. You may need to look back at old labs to refresh yourself on the commands.

Exploratory analysis

Load the nc data set.

download.file("http://www.openintro.org/stat/data/nc.RData", destfile = "nc.RData")
load("nc.RData")

We have observations on 13 different variables, some categorical and some numerical. The meaning of each variable is as follows.

variable	description
`fage`	father’s age in years.
`mage`	mother’s age in years.
`mature`	maturity status of mother.
`weeks`	length of pregnancy in weeks.
`premie`	whether the birth was classified as premature (premie)
or full-term.
`visits`	number of hospital visits during pregnancy.
`marital`	whether mother is `married` or `not married` at birth.
`gained`	weight gained by mother during pregnancy in pounds.
`weight`	weight of the baby at birth in pounds.
`lowbirthweight`	whether baby was classified as low birthweight (`low`)
or not (`not low`).
`gender`	gender of the baby, `female` or `male`.
`habit`	status of the mother as a `nonsmoker` or a `smoker`.
`whitemom`	whether mom is `white` or `not white`.

Now do Exercise 1.

Consider the possible relationship between a mother’s smoking habit and the weight of her baby. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.

Now do Exercise 2.

The box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the following function to split the weight variable into the habit groups, then take the mean of each using the mean function.

by(nc$weight, nc$habit, mean, na.rm = TRUE)

There is an observed difference, but is this difference statistically significant? In order to answer this question we will conduct a hypothesis test.

Inference

Now do Exercise 3.

Next, we use R’s built-in t.test() function to conduct a hypothesis test for the difference in mean birth weight between the two habit groups. Enter the following command in a code chunk as part of Exercise 4.

t.test(weight ~ habit, data = nc, alternative = "two.sided")

Let’s walk through the arguments of t.test(). The first argument uses R’s formula syntax: weight ~ habit means “compare weight across the groups defined by habit.” The data = nc argument tells R which data frame to use. The alternative argument specifies the direction of the alternative hypothesis; here "two.sided" means we are testing for any difference in either direction. The output includes the test statistic, degrees of freedom, p-value, and a 95% confidence interval for the difference in means.

The commented scaffold in your template shows the structure of the call. Remove the comment characters (#) and fill in the variable names to run the test.

Now do Exercise 4.

Now do Exercise 5.

Notice that t.test() produces both a p-value and a confidence interval in the same output. In Exercise 5 you are asked to read off and interpret the confidence interval that already appears in the Exercise 4 output. You do not need to run a separate command.

Now do Exercise 6.

For Exercise 6 we test whether the average weight gained during pregnancy differs between younger and mature mothers. The structure of the call is identical to Exercise 4, but with different variable names. Use the commented scaffold in the template as your starting point.

ANOVA: Analysis of Variance

So far in class, we have performed hypothesis tests for one population or for two populations, but sometimes we may be interested in comparing more than two populations. Consider the variable mature. It is a categorical variable with two categories: mature mom and younger mom. Suppose, however, that we are interested in introducing a third category teen mom. To this end, we introduce a new column to the dataset nc titled mature.new. The code is shown below. Put it in a code chunk in your notebook as part of Exercise 7.

nc$mature.new <- as.character(nc$mature)
nc$mature.new[nc$mage < 20] <- "teen mom"
nc$mature.new <- as.factor(nc$mature.new)

Note that the variable mature.new is a new variable we are creating.

Now do Exercise 7.

While side-by-side boxplots may indicate a relationship between two variables, we need to perform a hypothesis test to determine whether this relationship is significant or merely a consequence of random sampling. But how do we perform a hypothesis test with three populations? It may be helpful to start by thinking about what our hypotheses should be. If we let \(\mu_m\), \(\mu_y\), and \(\mu_t\) denote the true means of the mature, younger, and teen mom populations, respectively, then we can state the hypotheses as,

\[\begin{align*} H_0: & \quad \mu_m = \mu_y = \mu_t \\ H_a: & \quad \text{at least one of the means differs from the others.} \end{align*}\]

It may seem intuitive to perform 3 hypothesis tests: teen mom vs. younger mom, teen mom vs. mature mom, and mature mom vs. younger mom. The code for the teen mom vs. younger mom hypothesis test is shown below.

nc.ty <- subset(nc, mature.new %in% c("teen mom", "younger mom"))
nc.ty$mature.new <- droplevels(nc.ty$mature.new)
t.test(weight ~ mature.new, data = nc.ty, alternative = "two.sided")

This code is a bit involved. Let’s walk through it. First, we create a new data frame called nc.ty (the “t” is for “teen mom” and the “y” is for “younger mom”). We use subset() to keep only the rows for teen and younger moms, leaving out the mature mom data. Next, droplevels() removes any unused factor levels — in this case it removes the “mature mom” level since we are no longer using it. Finally, t.test() conducts a hypothesis test for the difference in mean birth weights between infants of teen mothers and infants of younger mothers.

Now do Exercise 8.

There is a problem with performing 3 separate hypothesis tests. Each test carries its own chance of a Type I error, and those errors accumulate. When \(\alpha = 0.05\), the probability of not making a Type I error on a single test is 0.95. To correctly conclude that all three means are equal, we would need all three tests to avoid a Type I error simultaneously. Assuming independence, that probability is, \[ (0.95)(0.95)(0.95) = 0.8574. \] This is substantially lower than 95%. Note that this calculation assumes the three tests are independent, which is not exactly true here since they share data, but it gives the right intuition: performing multiple tests inflates the overall Type I error rate and makes us more likely to reject \(H_0\) even when it is true.

Thankfully, there is a specific test designed for exactly this scenario. It is called ANOVA, which stands for Analysis of Variance. The inner workings of ANOVA are a bit beyond the scope of this course. At its core, it is a hypothesis test that lets us compare all group means simultaneously, controlling the Type I error rate at the chosen \(\alpha\). The code below performs ANOVA for the three populations of interest. Include it in a code chunk as part of Exercise 9.

my.anova <- aov(weight ~ mature.new, data = nc)
summary(my.anova)

Now do Exercise 9.

Acknowledgements

This lab is a modification of a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel. It has been edited extensively by Ben Jackson.

MATH 106: Introduction to Statistics

Lab 8: Hypothesis Testing and ANOVA

North Carolina births

Exploratory analysis

Inference

ANOVA: Analysis of Variance

Acknowledgements