Introduction

Background

t-Tests are a great way of identifying if two group means are statistically different. This can be done by comparing a sample to the population (one-sample) or comparing two different samples (two-sample). This tutorial will focus on the latter.

Applications

In the military, a common application of t-tests is in Test and Evaluation (T&E). Acquisition programs are being tested against requirements, and t-tests serve as a way to determine if the program meets or exceeds the standard. Especially, in Operational T&E, when sample sizes are small due to cost, t-tests are a valuable tool. Another example of the application within T&E is with upgrades to current programs. T-tests can be used to determine if the new system performs at least as well as the legacy system.

Packages

The functions utilized in this tutorial come from the built-in stats package. However, dplyr will need to be loaded for one example.

library(dplyr)

Functions

We can use built-in R functions to perform t-tests. There are multiple functions to perform t-tests in R, but for this tutorial we will look at two of them: t.test and pairwise.t.test.

Data

We will leverage built-in R data from sleep and airquality (R Core Team 2017). Snippets of each of these data sets are shown below.

head(sleep)
##   extra group ID
## 1   0.7     1  1
## 2  -1.6     1  2
## 3  -0.2     1  3
## 4  -1.2     1  4
## 5  -0.1     1  5
## 6   3.4     1  6
head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

Two-Sample t-Tests

As previously stated, two-sample t-tests are used when you want to compare two different samples. In order to do this in R, you can use t.test or pairwise.t.test. This section will focus on t.test.

Unpaired t-Test

The unpaired t-test is used when you want to compare two independent groups.

Suppose we want to test if the average increased sleep with two different drugs are equal, using the sleep dataset and t.test function.

t.test(extra~group, data = sleep)
## 
##  Welch Two Sample t-test
## 
## data:  extra by group
## t = -1.8608, df = 17.776, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.3654832  0.2054832
## sample estimates:
## mean in group 1 mean in group 2 
##            0.75            2.33

As you can see, t.test outputs all of the information you need for a two-sample t-test. The p-value, alternative hypothesis (H1), confidence interval around the true difference in means, and the sample averages of each group, just to name a few. Notice that the p-value is > 0.05, implying that there does not exist enough evidence to reject the null; that is, the two groups are not statistically different.

You can also save the result of t.test and use the $ operator to extract the results, like so:

sleep_unpaired <- t.test(extra~group, data = sleep)
sleep_unpaired$p.value
## [1] 0.07939414

Paired t-Test

Use paired t-tests when obervations from one group are paired with the other. This can be done easily in R, by simply adding paired = TRUE when calling t.test and/or pairwise.t.test.

t.test(extra~group, data = sleep, paired = TRUE)
## 
##  Paired t-test
## 
## data:  extra by group
## t = -4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.4598858 -0.7001142
## sample estimates:
## mean of the differences 
##                   -1.58

If you compare the results of the paired test with the unpaired, you may notice that the p-values are different: 0.079 for the unpaired test and 0.003 for the paired. Why is this?

Even though the two methods are comparing group means, they are looking at the data in different ways. Just look at how the test statistics are computed. For unpaired tests, the test statistic is calculated as follows:

\[ t_{unpaired} = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{s^2(\frac{1}{n_1} + \frac{1}{n_2})}} \] Where \(\bar{x}_i\) is the mean of group i, \(n_i\) is the sample size of group i, and \(s^2\) is the pooled variance.

For paired tests, the test statistic is calculated as follows:

\[ t_{paired} = \frac{\bar{d}}{\frac{s_d}{\sqrt{n}}} \] where \(\bar{d}\) is the mean difference between observations (\(y_i - x_i\)), \(s_d\) is the standard deviation of the differences, and \(n\) is the sample size.

The unpaired test compares group means idependently. However, when utilizing paired t-tests, the observations are not independent and so tests the mean difference to determine if there is a signficant difference between treatments.

Other t-Test Adjustments

This tutorial has demonstrated how to do unpaired and paired t-tests. There are other adjustments we can make to these functions and this section will explain a few of the available options. The following function adjustments work for both t.test and pairwise.t.test, but this section will only demonstrate the examples using t.test.

Alternative Hypothesis

You can adjust the alternative hypothesis easily when calling the function. The default alternative is two-sided, which means your null hypothesis is \(H_0 = 0\) and the alternative is \(H_A \neq 0\). For one-sided tests,

set alternative = "less" if \(H_0 \ge 0\) and \(H_A < 0\), and
      alternative = "greater" if \(H_0 \le 0\) and \(H_A > 0\) (R Core Team 2017).

Suppose we want to test if the difference in average increased sleep of the two groups is \(\le 0\), using the sleep dataset:

t.test(extra~group, data = sleep, alternative = "greater") 
## 
##  Welch Two Sample t-test
## 
## data:  extra by group
## t = -1.8608, df = 17.776, p-value = 0.9603
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -3.053381       Inf
## sample estimates:
## mean in group 1 mean in group 2 
##            0.75            2.33

Notice that the p-value is > 0.05, implying that there does not exist enough evidence to reject the null; that is, the difference of the means is \(\le\) 0.

Defining \(\mu\)

You can also adjust the true value of the mean difference being tested by setting mu = N, where N is the true value of the mean difference (R Core Team 2017).

Suppose we want to test if the difference in average increased sleep of the two groups is \(\ge 1\), using the sleep dataset:

t.test(extra~group, data = sleep, alternative = "less", mu = 1) 
## 
##  Welch Two Sample t-test
## 
## data:  extra by group
## t = -3.0385, df = 17.776, p-value = 0.003571
## alternative hypothesis: true difference in means is less than 1
## 95 percent confidence interval:
##        -Inf -0.1066185
## sample estimates:
## mean in group 1 mean in group 2 
##            0.75            2.33

Notice that the p-value is < 0.05, implying that there does exist enough evidence to reject the null; that is, the difference of the means is < 1.

Equal Variances

The default for both t.test and pairwise.t.test is to assume unequal variances. However, if you would like to test with the assumption that the two variances are equal, you can add var.equal = TRUE when you call the function. (R Core Team 2017)

Suppose we want to test if the difference in average increased sleep of the two groups are equal, assuming the variances of the two groups are equal and using the sleep dataset:

t.test(extra~group, data = sleep, var.eq = TRUE) 
## 
##  Two Sample t-test
## 
## data:  extra by group
## t = -1.8608, df = 18, p-value = 0.07919
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.363874  0.203874
## sample estimates:
## mean in group 1 mean in group 2 
##            0.75            2.33

Note that the p-value is > 0.5, therefore there is not enough evidence to show that the means of the two groups are not equal.

Confidence Level

The default confidence level of the interval is 0.95 (95%). You can adjust the confidence level that you are interested in by adding conf.level = a, where 0 < a < 1. (R Core Team 2017)

Suppose we want to test if the difference in average increased sleep of the two groups are equal at an 80% confidence level, using the sleep dataset:

t.test(extra~group, data = sleep, conf.level = .8) 
## 
##  Welch Two Sample t-test
## 
## data:  extra by group
## t = -1.8608, df = 17.776, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0
## 80 percent confidence interval:
##  -2.7101645 -0.4498355
## sample estimates:
## mean in group 1 mean in group 2 
##            0.75            2.33

Based on the p-value, we again fail to reject the null hypothesis. That is, we are 80% confident that the means of the two groups are not statistically significantly different.

Multiple Pairwise Comparisons

In the previous section, we looked solely at applications of t.test. This section will focus on the advantages of using pairwise.t.test. Unlike t.test, pairwise.t.test only provides the p-value for the comparison. The t.test function can only test two samples at a time. However, pairwise.t.test can perform multiple pairwise comparisons and output a triangular matrix of the p-values between all groups.

In addition, pairwise.t.test will adjust the p-values using the “holm” method by default (R Core Team 2017). You can set p.adjust = to any of the following options:

p.adjust.methods
## [1] "holm"       "hochberg"   "hommel"     "bonferroni" "BH"        
## [6] "BY"         "fdr"        "none"

Using any of the above methods, besides “none.” will allow you to maintain the overall p.value across multiple comparisons. To show how pairwise.t.test works with multiple groups, we will use the airquality dataset to show this.

Suppose we want to test if there is a difference of mean Ozone between months.

attach(airquality)
Month <- factor(Month, labels = month.abb[5:9])
pairwise.t.test(Ozone, Month)
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  Ozone and Month 
## 
##     May     Jun     Jul     Aug    
## Jun 1.00000 -       -       -      
## Jul 0.00026 0.05113 -       -      
## Aug 0.00019 0.04987 1.00000 -      
## Sep 1.00000 1.00000 0.00488 0.00388
## 
## P value adjustment method: holm

As you can see, it outputs a lower triangular matrix of all of the pairwise comparisons. You can use this to determine which differences are statistially significantly different, and then look at their means to determine which factor is superior. For instance, the p-value for the difference between Ozone for May and July is very small, indicating that there is a statistically significant difference. We can determine the mean Ozone for each month, and compare them

(MayAvg<- airquality %>%
          filter(Month == 5) %>%
          select(Ozone) %>%
          as.matrix() %>%
          mean(na.rm = T))
## [1] 23.61538
(JulyAvg<-airquality %>%
          filter(Month == 7) %>%
          select(Ozone) %>%
          as.matrix() %>%
          mean(na.rm = T))
## [1] 59.11538

In July, the mean ozone in parts per billion is more than twice that in May.

Conclusion

You can easily do two-sample t-tests with built-in R functions. Throughout this tutorial you have learned how to do unpaired, paired, and multiple pairwise t-tests. In addition, you have learned how to change the default settings in order to customize the t-tests to get the results you need.

Test your knowledge with the exercises below!

Exercises

For the following exercises, load the npk dataset. This is a three-factor fractional factorial experiment conducted on 6 blocks. The three factors are N (nitrogen), P (phosphate), and K (potassium). If N, P, or K equals 1, then that element was applied at that datapoint; 0 indicates that it was not applied. The response is labeled Yield and indicates the yield of peas in pounds/plot. (R Core Team 2017)

head(npk)
##   block N P K yield
## 1     1 0 1 1  49.5
## 2     1 1 1 0  62.8
## 3     1 0 0 0  46.8
## 4     1 1 0 1  57.0
## 5     2 1 0 0  59.8
## 6     2 1 1 1  58.5
attach(npk)

Exercise 1

Is there a difference in average yield between groups with Nitrogen and groups without? Suppose, you want to detect the difference with 90% confidence.

t.test(yield~N, conf.level = .9)
## 
##  Welch Two Sample t-test
## 
## data:  yield by N
## t = -2.4618, df = 21.88, p-value = 0.02218
## alternative hypothesis: true difference in means is not equal to 0
## 90 percent confidence interval:
##  -9.535247 -1.698086
## sample estimates:
## mean in group 0 mean in group 1 
##        52.06667        57.68333
t.test(yield~P, conf.level = .9)
## 
##  Welch Two Sample t-test
## 
## data:  yield by P
## t = 0.46147, df = 21.682, p-value = 0.6491
## alternative hypothesis: true difference in means is not equal to 0
## 90 percent confidence interval:
##  -3.222727  5.589394
## sample estimates:
## mean in group 0 mean in group 1 
##        55.46667        54.28333
t.test(yield~K, conf.level = .9)
## 
##  Welch Two Sample t-test
## 
## data:  yield by K
## t = 1.6374, df = 17.621, p-value = 0.1193
## alternative hypothesis: true difference in means is not equal to 0
## 90 percent confidence interval:
##  -0.239998  8.206665
## sample estimates:
## mean in group 0 mean in group 1 
##        56.86667        52.88333

Exercise 2

Which factor (or combination of factors) appears to affect pea yield the most? That is, which factor, or combination, yields the most peas in pounds/plot?

First, let’s find all pairwise comparisons using Bonferroni to ensure total confidence of 95%.

npk <- mutate(npk,fac_combined = 
                      ifelse(N == 1 & P == 0 & K == 0, "N" ,
                      ifelse(N == 0 & P == 1 & K == 0, "P",
                      ifelse(N == 0 & P == 0 & K == 1, "K",       
                      ifelse(N == 1 & P == 1 & K == 0, "NP",
                      ifelse(N == 0 & P == 1 & K == 1, "PK", 
                      ifelse(N == 1 & P == 0 & K == 1, "NK", "NPK")))))))

pairwise.t.test(npk$yield, npk$fac_combined, p.adj = "bonf" )
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  npk$yield and npk$fac_combined 
## 
##     K    N    NK   NP   NPK  P   
## N   0.36 -    -    -    -    -   
## NK  1.00 1.00 -    -    -    -   
## NP  1.00 1.00 1.00 -    -    -   
## NPK 1.00 0.25 1.00 1.00 -    -   
## P   1.00 1.00 1.00 1.00 1.00 -   
## PK  1.00 0.18 1.00 1.00 1.00 1.00
## 
## P value adjustment method: bonferroni

Because the group argument needs to be a vector, I created a new column based on the values of N, P, and K. Then, I calculated the pairwise comparisons using that new variable as the grouping. We notice that using Bonferroni, we do not detect any statistically significant differences. We want to determine which (if any) treatment is better overall, so we will need to look at the comparisons separately.

Based on these results, let’s see what the mean yield looks like across treatments.

attach(npk)
AvgYields <- group_by(npk, fac_combined) %>%
             summarize(mean(yield)) %>%
             arrange(desc(`mean(yield)`))
AvgYields
## # A tibble: 7 x 2
##   fac_combined `mean(yield)`
##          <chr>         <dbl>
## 1            N      63.76667
## 2           NP      57.93333
## 3           NK      54.66667
## 4            P      54.33333
## 5          NPK      52.90000
## 6            K      52.00000
## 7           PK      50.50000
boxplot(yield~fac_combined, 
        xlab =  "Treatment", 
        ylab = "Pea Yield in Pounds/Plot", 
        main = "Yield Comparison Across Treatments")

As we can see by the averages and the boxplots, N tends to stand out from the rest. Let’s see if there is a statistically significant difference between it and NP (the second largest average yield).

t.test(yield[fac_combined == "N"], yield[fac_combined == "NP"])
## 
##  Welch Two Sample t-test
## 
## data:  yield[fac_combined == "N"] and yield[fac_combined == "NP"]
## t = 1.3516, df = 3.9781, p-value = 0.2482
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.175176 17.841843
## sample estimates:
## mean of x mean of y 
##  63.76667  57.93333

We did not detect a difference, so let’s move down the list until we find a statistically significant difference in average yield.

t.test(yield[fac_combined == "N"], yield[fac_combined == "NK"])
## 
##  Welch Two Sample t-test
## 
## data:  yield[fac_combined == "N"] and yield[fac_combined == "NK"]
## t = 2.386, df = 3.8671, p-value = 0.07771
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.634076 19.834076
## sample estimates:
## mean of x mean of y 
##  63.76667  54.66667
t.test(yield[fac_combined == "N"], yield[fac_combined == "P"])
## 
##  Welch Two Sample t-test
## 
## data:  yield[fac_combined == "N"] and yield[fac_combined == "P"]
## t = 1.5274, df = 3.0762, p-value = 0.2219
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -9.949417 28.816083
## sample estimates:
## mean of x mean of y 
##  63.76667  54.33333
t.test(yield[fac_combined == "N"], yield[fac_combined == "NPK"])
## 
##  Welch Two Sample t-test
## 
## data:  yield[fac_combined == "N"] and yield[fac_combined == "NPK"]
## t = 3.1197, df = 3.7148, p-value = 0.0393
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   0.8956038 20.8377296
## sample estimates:
## mean of x mean of y 
##  63.76667  52.90000

Continuing, we can see that the first treatment to be statistically significantly different from treatment N is NPK. This would imply that N is the biggest contributor to increase in pea yield, but the interaction between all three reduces the performance signicantly.

References

R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.