Two-Sample Hypothesis Testing

Introduction

Classical hypothesis testing is concerned with testing two statements, the null, and alternative hypothesis. The null hypothesis is the statement that is believed to be true while the alternative hypothesis is a statement against the null that is often desired to be true.

In this example, we will develop a hypothesis concerning the salaries of professors in a particular university designated by discipline. Discipline is categorized by the ‘theoretical’ departments and the ‘applied’ departments of the university. The hypothesis to test is there is no significant difference in salary between the ‘theoretical’ and ‘applied’ departments of the university. This scenario would be one case where one would likely want the null hypothesis to be true since higher salaries among different but similar groups would be considered unfair. We have no reason to assume there is a difference in salaries between the departments, so we use that as the null.

\(H_0\) denotes the null hypothesis, and the alternative hypothesis is designated \(H_A\). We can write the above hypothesis regarding the null and alternative as:

\[ H_0: \mu_1 = \mu_2 \] \[ H_A: \mu_1 \neq \mu_2 \]

Where \(\mu_1\) is the mean of the theoretical departments, and \(\mu_2\) is the mean of the applied departments.

The goal of hypothesis testing is to provide enough evidence to reject the null for the alternative. This is done by controlling for the probability of two types of error occurring, named Type I and Type II Error. Type I error is the probability of rejecting the null hypothesis as false when it is indeed true. Type II Error is thus the probability of accepting the null hypothesis as true when it is indeed false.

When testing a hypothesis, one of the primary concerns is keeping Type I Error at a certain level, typically set at 0.05.

The rejection or acceptance of the null hypothesis is determined by the p-value, which is the probability of achieving the same result assuming the null hypothesis is true. The p-value is usually 0.05, or 5%. Here’s a great explanation of the p-value if you want to learn more.

Testing the Hypothesis

To test our hypothesis, suppose we collect sample data on the salaries of the professors in the departments categorized as theoretical and applied. To this effect, we use the Salaries dataset in the cars package. We can also load the other packages we’ll use in this example.

library(ggplot2)
library(car)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(reshape2)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

data(Salaries)

The Salaries dataset is loaded using the data() command. The dataset contains six variables with various information including professors’ years of service, gender, and rank.

The levels ‘A’ and ‘B’ denote the department disciplines. Let’s go ahead and change these to the actual discipline names to make it easier to distinguish the groups.

Salaries$discipline <- ifelse(Salaries$discipline == 'A', 'Theoretical', 'Applied')

In this example, we are only interested in the discipline and salary variables, so we remove the unneeded variables.

sal <- Salaries[,c(2,6)]

We can also split the data into two for each group. Splitting the data can be done using the filter() function from the dplyr package.

theo <- filter(sal, discipline == 'Theoretical') 
appl <- filter(sal, discipline == 'Applied')

Assumptions of the t test

The two populations are assumed to be normally distributed with mean \(u\) and variance \(\sigma^2\)
Data is assumed to be sampled independently from the two populations being compared.

Another assumption of the Student’s t test is the populations have the same variance. In this example, we will use Welch’s t test, which is more reliable when the populations have unequal variance and the sample sizes are different.

Testing for Normality

There are multiple ways to test that the data is normally distributed. One quick way is to plot a histogram and examine the shape. If it is approximately bell-shaped, that is strong indication the data is normal.

Plot a density plot of the two samples using the package ggplot.

ggplot(Salaries, aes(Salaries$salary)) + 
  geom_density(aes(data = Salaries$salary, fill = Salaries$discipline), position = 'identity', alpha = 0.5) +
  labs(x = 'Salary', y = 'Density') + scale_fill_discrete(name = 'Discipline') + scale_x_continuous(limits = c(0, 300000))

The data exhibits an approximately normally distributed bell shape, though it does appear to be somewhat positively skewed. A Q-Q plot can also be plotted to check normality.

par(mfrow=c(1,2))
qqPlot(theo$salary, main = 'Theoretical Departments')
qqPlot(appl$salary, main = 'Applied Departments')

The data mostly falls on the straight line which is another good indicator of normality. It’s important to note we haven’t proved the data is normally distributed, rather, through the histogram and Q-Q plots, that the data is reasonably consistent with normality.

Welch’s t test

Welch’s t-test is an adaption of Student’s t test and is more performant when the sample variances and size are unequal. The test still depends on the assumption of the underlying population distributions being normally distributed.

Welch’s t test is defined as:

\[ t = \frac{\bar{X_1} - \bar{X_2}}{\sqrt{\frac{s_{1}^{2}}{N_1} + \frac{s_{2}^{2}}{N_2}}} \]

where:

\(\bar{X}\) is the sample mean, \(s^2\) is the sample variance, \(n\) is the sample size

Therefore, we need to find the above for the two samples.

With the data collected, we can perform the test. Since the groups are independent, we can use the unpaired two-sample t-test. Before conducting the test outright, it is important we confirm several assumptions.We can now calculate the statistics to use Welch’s t-test. We’ll start by finding the mean, which is defined as:

\[ \large \bar{X} = \sum_{i=1}^{n}X_i \]

Or, the sum of all data points in the sample divided by the sample size. The mean can be found in R using the built-in method mean() or writing the equation:

sum(theo$salary) / length(theo$salary)

## [1] 108548.4

mean(theo$salary)

## [1] 108548.4

The variance of the sample is the sum of all points squared, subtracted by the squared sum of all data points, divided by the sample size and then divided again by the sample size minus one.

\[ \large \sigma^2 = \frac{\sum_{i=1}^{N}x_i^2 - (\sum_{i=1}^{N}x_i)^2/N}{n - 1} \]

Again, we can calculate the variance by using the built-in R method var() or writing it out:

(sum(theo$salary^2) - (sum(theo$salary)^2)/length(theo$salary)) / (length(theo$salary) - 1)

## [1] 932578340

var(theo$salary)

## [1] 932578340

Save all the necessary statistics. We calculate standard deviation here as it is the square root of the variance.

theo_mean <- mean(theo$salary)
theo_sd <- sd(theo$salary)
appl_mean <- mean(appl$salary)
appl_sd <- sd(appl$salary)

We now have everything we need to perform the test.

Performing the t-test

R’s built-in t.test() function uses Welch’s version automatically in this case.

welch.test <- t.test(appl$salary, theo$salary)
welch.test

## 
##  Welch Two Sample t-test
## 
## data:  appl$salary and theo$salary
## t = 3.1306, df = 377.83, p-value = 0.00188
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   3525.978 15434.549
## sample estimates:
## mean of x mean of y 
##  118028.7  108548.4

From the output of the test we can see the p-value is equal to 0.00188, which is much less than 0.05. Therefore, we reject \(H_0\) and conclude there is a difference between the mean salary of the theoretical and applied departments of the university.

Manually Performing the t-test

We could also do the above t-test manually by calculating the t statistic, finding the degrees of freedom, and then use those values to find the rejection region and p-value.

To calculate the t statistic, we code Welch’s test:

t.val <- (appl_mean - theo_mean) / sqrt(appl_sd^2/length(appl$salary) + theo_sd^2/length(theo$salary))
t.val

## [1] 3.130634

We also need the degrees of freedom \(v\) which is approximated from the Welch-Satterthwaite equation:

\[\large v \approx \frac{\left(\frac{s_{1}^2}{N_1} + \frac{s_{2}^2}{N_2}\right)^2}{\frac{\left(\frac{s_1^2}{N_1^{2}}\right)^2}{v_1} + \frac{\left(\frac{s_2^2}{N_2^{2}}\right)^2}{v_2}}\]

Where \(v_1 = N_1 - 1\) degrees of freedom and \(v_2 = N_2 - 1\) degrees of freedom. The calculation can become rather unyieldy quickly, so numerator and denominator of the equation are separated to make it easier to keep it straight.

v_numer <- (theo_sd^2/length(theo$salary) + appl_sd^2/length(appl$salary))^2
v_denom <- ((theo_sd^2/length(theo$salary))^2 / (length(theo$salary)-1) + (appl_sd^2/length(appl$salary))^2 / (length(appl$salary)-1))

v.dof <- v_numer / v_denom

Then we can see the t statistic and degrees of freedom we calculated are equal to the output as given by the t.test() method.

welch.test$statistic

##        t 
## 3.130634

t.val

## [1] 3.130634

welch.test$parameter

##       df 
## 377.8305

v.dof

## [1] 377.8305

To calculate the p-value, we find the probability that the critical value of t is greater than the absolute value of the t statistic we computed earlier, 3.1306337. This is known as the rejection region and can be written as:

\[ \large \left|t_{obs}\right| > t_{\alpha/2, n_1+n_2-2} \]

The p-value is then found by plotting the Student’s t distribution and finding the where the t statistic lands on the plot with the given degrees of freedom. We can write this formally as:

\[ 2P(\left|t_{obs}\right|) \]

The probability is multiplied by two as the test is two-tailed. We can calculate the p-value we obtained using the pt() command by the following, which is equal to the p-value as reported by the t.test() function.

(1-pt(t.val, v.dof))*2

## [1] 0.001880022

welch.test$p.value

## [1] 0.001880022

Conclusion

In this example, a t-test was performed using Welch’s alternative formulation built into R as well as manually calculated to ensure the results. As seen from the above, R performs Welch’s variant when the variances or sample sizes are unequal. It’s recommended to use Welch’s t-test over Student’s as the former gives nearly the same results when sample variances and sizes are equal.

In coming examples, we will explore how to calculate confidence intervals, which are often a more efficient method of testing a hypothesis rather than the standard hypothesis testing examined in this post, as well as nonparametric methods and approaches when there are more than two samples.