Getting Started

`oilabs` package

If you have not done so already, please install the oilabs package. To do this, go to the console and type install.packages("oilabs"). Do NOT include this code in your R notebook.

Do NOT include the install.packages("oilabs") in your R notebook. It is bad practice and it will create an error when you knit to PDF.

Load packages

In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization.

Let’s load the packages.

library(tidyverse)
library(here)
library(oilabs)

The data

In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.

Load the nc data set.

nc <- read_csv(here('data', 'ncbirths.csv'))
head(nc)

We have observations on 12 different variables, some categorical and some numerical. The meaning of each variable can be found in the separate documentation file ncbirths_documentation.txt.

What are the cases/observations in this data set? How many cases/observations are there in our sample?

summary(nc)

Exploratory data analysis

We will first start with analyzing the weight gained by mothers throughout the pregnancy: gained.

Using visualization and summary statistics, describe the distribution of weight gained by mothers during pregnancy.

Answer : The distribution of weight gained by mothers during pregnancy is almost normal. Because we can see shape (bell) of distribution from the following visualization. It is not strongly skewed. It is nearly symmetric as we can see from the summary of ‘gained’ that the mean and median are almost close to each other.

ggplot(data = nc) + geom_bar(mapping = aes(x= gained))

summary(nc$gained)

How many mothers are we missing weight gain data for?

nc %>%filter(!is.na(gained))

Next, consider the possible relationship between a mother’s smoking habit and the weight of her baby. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.

Make a side-by-side boxplot of habit and weight. What does the plot highlight about the relationship between these two variables?

The box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions.

Calculate the mean weight and standard deviation for each group of the habit variable. Remove any missing values for habit.

There is an observed difference, but is this difference statistically significant? That is, is this observed difference due to chance (habit and weight are independent of each other) or can we infer that habit and weight are dependent? In order to answer this question we will conduct a hypothesis test .

Inference

Are all conditions necessary for inference satisfied? Comment on each. You can compute the group sizes with the summarize command above by defining a new variable with the definition n().
Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.

Next, we introduce a new function, inference, that we will use for conducting hypothesis tests and constructing confidence intervals.

inference(y = weight, x = habit, data = nc, statistic = "mean", type = "ht", null = 0, 
          alternative = "twosided", method = "theoretical")

Let’s pause for a moment to go through the arguments of this custom function.

y is the response variable that we are interested in: weight.
x is the explanatory variable, which is the variable that splits the data into two groups, smokers and non-smokers: habit.
data, is the data frame these variables are stored in.
statistic is the sample statistic we’re using; or, similarly, the population parameter we’re estimating.
type sets the type of inference we want: a hypothesis test ("ht") or a confidence interval ("ci").
When performing a hypothesis test, we also need to supply the null value, which in this case is 0, since the null hypothesis sets the two population means equal to each other.
The alternative hypothesis can be "less", "greater", or "twosided".
Lastly, the method of inference can be "theoretical" or "simulation" based.

For more information on the inference function see the help file with ?inference.

Using the above output from the inference function, what is an appropriate point estimate for the difference in the population means \(\mu_{\rm ns} - \mu_{\rm s}\)?
Using the above output from the inference function, calculate the standard error of the estimate.
Change the type argument to "ci" to construct and record a confidence interval for the difference between the weights of babies born to nonsmoking and smoking mothers, and interpret this interval in context of the data. Note that by default you’ll get a 95% confidence interval. If you want to change the confidence level, add a new argument (conf_level) which takes on a value between 0 and 1. Also note that when doing a confidence interval arguments like null and alternative are not useful, so make sure to remove them.

By default the function reports an interval for (\(\mu_{\rm non-smoker} - \mu_{\rm smoker}\)). We can easily change this order by using the order argument:

inference(y = weight, x = habit, data = nc, statistic = "mean", type = "ci", 
          method = "theoretical", order = c("smoker","nonsmoker"))

This is a modified version of a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This unmodified version of this lab was adapted for OpenIntro by Mine Çetinkaya-Rundel from a lab written by the faculty and TAs of UCLA Statistics.

Inference for numerical data - Part 1