Infer: A Tidyverse Approach to Statistical Inference in R

The infer package in R provides a tidyverse-friendly framework for performing statistical inference. It emphasizes a workflow based on verbs that clearly express the steps involved in hypothesis testing and confidence interval estimation. This makes it easier to understand and communicate the logic behind your statistical analyses.

Here’s a tutorial covering the core functionalities of infer:

1. Installation and Loading

First, install and load the infer package along with dplyr and ggplot2 (which infer often uses):

install.packages("infer")
install.packages("dplyr")
install.packages("ggplot2")

library(infer)
library(dplyr)
library(ggplot2)

2. Data Preparation

We’ll use the mtcars dataset for demonstration.

data(mtcars)
head(mtcars)

3. Hypothesis Testing

Let’s say we want to test whether the mean miles per gallon (mpg) is significantly different from 20.

a. Specify the Hypothesis

We’ll use a one-sample t-test. Our null hypothesis (H0) is that the population mean mpg is 20. Our alternative hypothesis (HA) is that the population mean mpg is not 20.

b. Calculate the Observed Statistic

We calculate the observed mean mpg from our sample.

observed_mean <- mtcars %>%
  summarize(mean_mpg = mean(mpg)) %>%
  pull(mean_mpg)

observed_mean

c. Generate the Null Distribution

We use infer to generate the null distribution by simulating data under the null hypothesis.

null_distribution <- mtcars %>%
  specify(response = mpg) %>%
  hypothesize(null = "point", mu = 20) %>%
  generate(replicates = 1000, type = "bootstrap") %>%
  calculate(stat = "mean")

d. Visualize the Null Distribution and Observed Statistic

ggplot(null_distribution, aes(x = stat)) +
  geom_histogram(binwidth = 1) +
  geom_vline(xintercept = observed_mean, color = "red")

e. Calculate the P-value

The p-value is the probability of observing a statistic as extreme as or more extreme than the observed statistic, assuming the null hypothesis is true.

p_value <- null_distribution %>%
  get_p_value(obs_stat = observed_mean, direction = "both")

p_value

4. Confidence Intervals

Let’s calculate a 95% confidence interval for the mean mpg.

a. Generate the Bootstrap Distribution

We generate a bootstrap distribution to estimate the variability of the sample mean.

bootstrap_distribution <- mtcars %>%
  specify(response = mpg) %>%
  generate(replicates = 1000, type = "bootstrap") %>%
  calculate(stat = "mean")

b. Calculate the Confidence Interval

We use get_confidence_interval() to calculate the confidence interval.

confidence_interval <- bootstrap_distribution %>%
  get_confidence_interval(level = 0.95)

confidence_interval

5. Comparing Two Groups

Let’s compare the mean mpg for cars with automatic (am = 0) and manual (am = 1) transmissions.

a. Specify the Hypothesis

H0: The mean mpg is the same for both transmission types. HA: The mean mpg is different for the two transmission types.

b. Calculate the Observed Statistic

observed_diff <- mtcars %>%
  specify(mpg ~ am) %>%
  calculate(stat = "diff in means", order = c("1", "0"))

observed_diff

c. Generate the Null Distribution

We use permutation to generate the null distribution.

null_distribution_diff <- mtcars %>%
  specify(mpg ~ am) %>%
  hypothesize(null = "independence") %>%
  generate(replicates = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("1", "0"))

d. Calculate the P-value

p_value_diff <- null_distribution_diff %>%
  get_p_value(obs_stat = observed_diff, direction = "both")

p_value_diff

e. Confidence Interval for the Difference

bootstrap_diff <- mtcars %>%
  specify(mpg ~ am) %>%
  generate(replicates = 1000, type = "bootstrap") %>%
  calculate(stat = "diff in means", order = c("1", "0"))

conf_int_diff <- bootstrap_diff %>%
  get_confidence_interval(level = 0.95)

conf_int_diff

Key Advantages of infer:

This tutorial provides a basic introduction to infer. Explore the package documentation for more advanced features and examples.