In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.
We’ll use the tidyverse package for much of the data wrangling and visualization, and the data lives in the openintro package.
library(tidyverse)
library(openintro)
The dataset is called ncbirths, and since it is distributed with the package, we don’t need to load it separately; it becomes available to us when we load the openintro. However, if we have trouble installing the openintro package, the dataset is also available on Canvas in the file ncbirths.csv and can be loaded into R using the read_csv function that we have seen in previous labs.
You can find out more about the dataset by inspecting its documentation, which you can access by running ?ncbirths in the Console or using the Help menu in RStudio to search for ncbirths.
The first step in the analysis of a new dataset is getting acquainted with the data.
# Use commands for exploring data that were introduced in the previous labs to
# answer the questions below
What are the are the observational units in this data set? What is the sample size?
How many variables are there? How many are qualitative and how many are quantitative?
When we are first exploring data we also want to know if there are any outliers. While we can sometimes to this by looking at summary values for each variables (i.e., min, median, mean, max), it can be easier to do visually.
To create one figure with multiple graphs shown at once, we will need to use the gridExtra packaged in R, which we will need to install prior to running the next bit of code.
NOTE: Delete eval=FALSE prior to knitting the document
# Using ggplot2, create three scatter plots:
# 1. Age of mother (mage) against age of father (fage)
# 2. Mother's weight gain (gained) against birth weight (weight)
# 3. Weeks of gestation (weeks) against birth weight (weight)
# For each plot, you will need to:
# Define the dataset we want to work with (ncbirths)
# Define the aesthetics for the plot based on the 3 plots defined above
# Provide an informative title and axis labels
library(gridExtra)
plot1 <- ggplot() +
geom_point() +
labs()
plot2 <- ggplot() +
geom_point() +
labs()
plot3 <- ggplot() +
geom_point() +
labs()
grid.arrange(plot1,plot2,plot3,ncol=2)
A 1995 study suggests that average weight of Caucasian babies born in the US is 3,369 grams (7.43 pounds).1 In this dataset we only have information on mother’s race, so we will make the simplifying assumption that babies of Caucasian mothers are also Caucasian, i.e. whitemom = "white".
We want to evaluate whether the average weight of Caucasian babies has changed since 1995.
Our null hypothesis should state “there is nothing going on”, i.e. no change since 1995: \(H_0: \mu = 7.43~pounds\).
Our alternative hypothesis should reflect the research question, i.e. some change since 1995. Since the research question doesn’t state a direction for the change, we use a two sided alternative hypothesis: \(H_A: \mu \ne 7.43~pounds\).
NOTE: Delete eval=FALSE before knitting the document
# Do the following two tasks using R code:
# 1. Filter ncbirths by race (whitemom) to include only data from white mothers, and
# assign it to the object ncbirths_white
# 2. Summarize the weights (weight) of Caucasian babies using the mean
ncbirths_white <- ncbirths %>%
filter()
ncbirths_white %>%
summarize()
While it is possible to take a simulation-based approach to doing a hypothesis test for a mean, we are going to focus on the theory-based approach, the t-test, in this lab. In R we can do the t-test with the function t.test.
However, before we perform any statistical analyses we want to stop and think about whether the validity conditions (i.e., assumptions) for the test have been met. R, or any other program we use to do analyses, will calculate standardized statistics and p-values based on the numbers we give them but can’t tell us anything about whether we have operated the test safely. That is up to us to determine before we do the statistical analysis.
One way to check if our data have met the assumption of being normally distributed is to plot a histogram of the data.
NOTE Delete eval=FALSE before knitting the document
# In this code chunk, do the following:
# Define the dataset we want to work with (ncbirths_white)
# Define the aesthetic (weight) we want to plot for the histogram
# Add the geom to create a histogram
# Add an informative title a x-axis label
ggplot() +
Whenever we manipulate data, such as filtering the data set to only include white mothers, we always want to take a look at the data set to make sure that our manipulation did as we expected and to get a sense of the new structure of the data set that we are working with.
# Use one of the commands we have seen in previous labs to take a quick glimpse of the data
How many white mothers are in our data set?
Looking at the plot and our sample size, have the validity conditions to do a t-test been met? Explain.
Now that we have assessed our validity conditions, we can move forward with performing the t-test. As with the other functions we have seen in R so far, t.test has inputs it requires in order to run, some of which we have to provide and some of which have defaults. For now, we only care about the following four inputs to t.test, although it does have other arguments.
x – our data, in vector form. Only required input.alternative – whether we want to do a one- or two-tailed test. Defaults to two.sided, other options are less and greater. Quotation marks are needed on our specification of the type of test.mu – the hypothesized true value of the mean (\(\mu\))conf.level – the confidence level used to build the confidence interval. Defaults to 0.95.In this case, we want to investigate whether the weights of white babies has changed since 1995. Therefore, the data required by t.test is an object just containing the information on baby weights that are recorded in the variable weight. We therefore need to select this variable from our data set, ncbirths_white.
NOTE: Delete eval=FALSE before knitting the document
# Do the following two tasks using R code:
# 1. Select the variable weight from the ncbirths_white data set and assign it to the
# object baby_weight
# 2. Perform a t-test to evaluate whether the average weight of Caucasian babies has
# changed since 1995
baby_weight <- ncbirths_white %>%
t.test(x= , mu=)
Using a significance-level (i.e. \(\alpha\)-level) of 0.05, and based on the results of the t-test, is it plausible that the weights of Caucasian babies in 2004 are the same as those in 1995?
What is our confidence interval?
When we do a t-test, we build a confidence interval around our observed mean \(\bar{x}\), that has a width of 1-\(\alpha\). So at a significance level of 0.05, we would build a confidence interval of 95%. If we decide to use a different \(\alpha\)-level, then we need to build a different size confidence interval as well.
# Perform the same t-test as before, but now with a confidence level of 0.90
Does our decision regarding our null hypothesis change when we change the significance level?
What is our new confidence interval?
Our choice of alternative hypothesis can also affect our decision regarding the null hypothesis.
# Perform the same t-test as before, but now with an alternative hypothesis that mu
# is greater than our hypothesized value
How does our decision regarding our null hypothesis change when we change the alternative hypothesis?
Based on your answer to Question 10, is it now possible to say that we accept the null? Why or why not?
Wen, Shi Wu, Michael S. Kramer, and Robert H. Usher. “Comparison of birth weight distributions between Chinese and Caucasian infants.” American Journal of Epidemiology 141.12 (1995): 1177-1187.↩︎