paired t-test

A paired t-test is a parametric test that compares the differences between two means and tells you if they are significantly different from zero. A paired t-test has only one sampling variability assumption that has to be met:

  1.    The residuals have to be normally distributed.

The residuals are the leftover variation that is not accounted for by your explanatory variable. You can use an analysis of the residuals to check this assumption. Before we can interpret our statistical output, we need to make sure the assumption above is met.

Background Theory

A paired t-test is used when your sampling design is paired and you want to compare two categories of your explanatory variable to each other. It is easier to explain how a paired t-test works with an example. If you wanted to compare the abundance of barnacles at high and low tidal heights, you could set up a paired design whereby you measure barnacle abundance at “high” and “low” heights at various locations. The paired t-test would calculate the difference between your “high” and “low” heights at each location, take the mean, and see if that mean is significantly different from zero. This is more powerful than using a 2 sample t-test because the abundance of barnacles might vary with location resulting in a lot of variance around the “high” mean and the “low” mean. If there is high variance, then a 2 sample t-test is not likely to find any significant differences between the two groups.

Information about the Dataset

For this test, we will be using a dataset collected by a student during her Puget Sound summer research experience. Her research was examining the influence of a highly abundance moss species called step moss (Hylocomium splendens, Figure 1) on soil moisture content (among other things). She collected soil samples from the forest floor under step moss and not under step moss in a paired design. She hypothesized that step moss would decrease soil moisture during the dry season by intercepting incoming precipitation or moisture.

Figure 1. Step moss (Hylocomium splendens) is very abundant on the forest floor in the Hoh rainforest, WA.

Loading the Data into RStudio

In order to make your RStudio script file organized, you will want to include some information at the top of the file. You can use the hashtag (#) to include things in your RStudio file that R can’t read.

For each test, you should include the following lines at the top of each RStudio file:

#question:
#response variable:
#explanatory variable:
#test name:

Below is what it would look like for the step moss example:

#question: how does step moss presence influence soil moisture?
#response variable: soil moisture (continuous)
#explanatory variable: step moss presence (categorical)
#test name: paired t-test

In order to run a paired t-test, you need to have your response variable for one of your groups in one column and the other group in another column so that they are in a paired design. The image below shows you how to set up your Excel file for a paired t-test (Figure 2).

Figure 2. For a paired t-test, you need to have your response variable for each of your groups on the same line.

To bring data into RStudio, you can use the line of code below that will allow you to select any file directly from your computer. This code will work even if you have empty cells without anything in the cells (R will convert them to NAs) or periods in place of empty cells.

#code to load a datafile into R from your computer
DATA <- read.csv(file.choose(),na.strings=".")

An alternative way to open your file is to use the path directly to the folder on your computer where you set your working directory. This will allow you to choose a named file that you saved in your working directory (to learn more about what your working directory is or how to save files, go the the main webpage and click on Introduction to RStudio). In the DATA code line, TO.BE.EDITED is the name of your file.(Mac Users: remember to omit the “C:” in your code).

#code to set the working directory folder on your computer
setwd <- ('C:/Users/YourUserName/Documents/Rfiles')
#code to load a datafile into R from your working directory
DATA <- read.csv(file="C:/TO.BE.EDITED.csv", header=TRUE, sep=",", na.strings=".")

You can download the SoilMoisture_paired.xlsx file from the home page, save it in your working directory as a .csv file and then you can use the code below to upload the SoilMoisture_paired.csv file into RStudio.

#code to load the datafile "SoilMoisture_paired" into R from your working directory
DATA <- data.frame(read.table(file='SoilMoisture_paired.csv', sep=',', header=TRUE, fill=TRUE, na.strings="."))

You can also use the following lines of code to make sure your data was read in correctly. The first line of code results in the top 6 rows of your data being displayed instead of just the column names, and the second line of code gives you the dimensions of your datafile.

#code to see the names of the column titles
names(DATA)

## [1] "plot_num" "NOSM"     "SM"

#code to see the dimensions (number of rows and columns) of your datafile
dim(DATA)

## [1] 20  3

If the file was successfully loaded into RStudio, you should have the names of the columns above. The plot_num is just the plot number, NOSM is the soil moisture content of soil collected not under step moss and SM is the soil moisture content of the soil collected under step moss. If you got an error, go back to the main stats webpage and click on Troubleshooting. Once you are sure the data is entered, you can proceed.

Exploring the Data

Before we run the test, it is a good idea to explore what the data look like. This gives you a chance to see if there are any outliers and to determine if your prediction based on your hypothesis seems supported by the data (presumably you would have some information about step moss and soil moisture before measuring the moisture content of your soils and running this analysis on what you expect to find). We will examine how soil moisture content varies under step moss and not under step moss. For this test, the continuous response variable in the soil moisture content and the explanatory categorical variable is whether soil was under step moss or not.

You will use a McNeil plot and a profile plot for your data. These give you an overall view of whether or not your response variable is higher, lower, or the same between your two groups.

To be able to run the code that creates these plots, you need to install and load the package PairedData. You can do this by running the code below. The first line of code (install.packages (“PackageName”)) will install the package. The second line of code (library(“PackageName”)) loads it into R.

#code to install package "PairedData"
install.packages("PairedData")

#code to load package "PairedData"
library("PairedData")

## Loading required package: MASS

## Loading required package: gld

## Loading required package: mvtnorm

## Loading required package: lattice

## Loading required package: ggplot2

## 
## Attaching package: 'PairedData'

## The following object is masked from 'package:base':
## 
##     summary

You are going to create two types of plots to examine your data. The first one is called a McNeil plot. Here each “subject” is a sample showing two dots: one for each of your groups. The paired t-test is examining the difference between your two groups for each subject so if one of your groups is consistently higher than the other (one of the colored dots is consistently to one side of the other), regardless of the overall average difference between your subjects, the effect of your explanatory variable is significant (this is similar to a paired index plot). The second plot is a profile plot. It is a boxplot that shows a line between each of your paired groups. If most of the lines go one way, then it is likely that one group is consistently higher than the other and your response variable is significantly different among your explanatory variable groups.

#code to create the McNeil plot and the profile plot
attach(DATA)
pd1<-with(DATA,paired(NOSM,SM))
plot(pd1,type="McNeil")

plot(pd1,type="profile")

These data show that while there is a lot of variation in percent soil moisture in both groups, the soil moisture content not under step moss (NOSM) is consistently higher than the soil moisture content under step moss (SM). In the McNeil plot, the blue dots (SM) are consistently lower (to the left) than the pink dots (NOSM). In the profile plot, most of the lines go down from NOSM to SM. There are also no outliers.

Running the Paired T-test

Now you can run the paired t-test. Below is the code you will use to run the paired t-test. Remember that to actually run the test, you need to change the Response variable and Explanatory variable to your own variable names. Use the code for names(DATA) above to make sure you write in the variables exactly as they are in the data file. If you write in species instead of Species, for example, the test will not work because R won’t be able to find species in your datafile.

#code to run the paired t-test to test model assumptions
ttest<-t.test(Response variable for group 1, Response variable for group 2, paired = TRUE, data=DATA)

#code to run the paired t-test for the step moss data to test model assumptions
ttest<-t.test(DATA$NOSM,DATA$SM,paired=TRUE)

You will notice that not much happened. If you didn’t get any error messages, that means that the test was successful. R runs the code and stores the information. You have to do more work to get at that the results of the test. But, first we need to make sure we can even look at the results.

You never look at the results of the test until AFTER you check that the assumptions of the model were met. So next we will examine if our data meet the assumption of the residuals following a normal distribution.

Evaluating Model Assumptions

The difference between the observed and predicted values, called the residuals, is a measure of the error associated with each observation or the variation that is not explained by the explanatory variable. We plot the residuals in various ways to examine normality and homogeneity of variances.

To test whether the residuals are normally distributed (model assumption 1 above), you will use three pieces of information:

1. a Shapiro-Wilk test of normality
2. a density plot of residuals
3. a Q-Q plot of residuals.

The residuals in a paired t-test are calculated as the difference between the values of each pair (NOSM and SM) so you need to first calculate this distance (d in the code below) and then test the normality of that difference. Use the code below to run a Shapiro-Wilk test of normality.

#code to run a Shapiro-wilk test of normality on paired data
d<-DATA$NOSM - DATA$SM
shapiro.test(d)

## 
##  Shapiro-Wilk normality test
## 
## data:  d
## W = 0.98519, p-value = 0.9827

This test shows a p-value > 0.05, which means that the curve is not significantly different from a normal distribution. To create a density plot of residuals, use the code below.

#code to create a density plot of residuals
plot(density(na.omit(d)))

Examine the density plot. It should follow more or less a bell-shaped curve. It can be hard to determine whether or not your density plot is “normal” as it takes practice. If there are any weird bumps or the curve is clearly skewed to one side or bimodal, chances are your residuals are not normal. This density plot looks normal. See the ANOVA webpage for examples of non-normal and normal curves.

The Q-Q plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a normal or exponential.Q-Q plots take your sample data, sort it in ascending order, and then plot them versus quantiles calculated from a theoretical distribution, in this case the normal distribution. If the points fall pretty closely along the line, the data are normal.

To be able to run the code that creates a Q-Q plot, you need to load the package stats. You can do this by running the code below.

#code to load the package "stats" into R
library(stats)
#code to create the Q-Q plot
qqnorm(d)
qqline(d, datax = FALSE, distribution = qnorm, probs = c(0.25, 0.75))

This plot looks good as the points align with each other and fall along the line. The assumption of normality has been met, which means the data are normal.

If your model output DOES NOT meet the assumption of normality, see My Data Are Not Normal - What Do I Do? below.

If your model output DOES meet the assumption of normality, proceed to Interpreting Paired T-test Output below.

Interpreting Paired T-Test Output

If your assumptions have been met, use the code below to run the t-test.

ttest<-t.test(NOSM,SM,paired=TRUE, data=DATA) 
ttest

## 
##  Paired t-test
## 
## data:  NOSM and SM
## t = 4.4099, df = 19, p-value = 0.0003009
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  13.48773 37.85706
## sample estimates:
## mean of the differences 
##                 25.6724

According to the paired t-test, the soil moisture content under step moss is significantly lower than not under step moss. You should copy and paste the results into a Word doc or Excel file so you record it somewhere. For your results section, you should note the t-statistic, df and the p-value.

Presenting your Results

Results Statements

The results section of your paper should begin with a narrative of your results statements. These statements should be quantitative in nature and include 1. the statistical significance and 2. the biological significance.

Statistical Significance: Your first sentence should list the results of the statistical test (in this case whether soil moisture content varied significantly under step moss or not). You include statistical data in parentheses at the end of the sentence only. You should never write “The p-value was…” or “The t-stat was, which means…”. You don’t write about your statistics, you just write the biological results and include your statistics in parentheses. This satisfies whether your results were statistically significant.
Biological Significance: For the quantitative statements, you should include means and standard errors of each group and their effect size. To get the means and standard errors for each group, you can use the code below. Recall that standard error is the standard deviation (sd in the code below) divided by the square root of the number of samples (sqrt(length(DATA\(group[!is.na(DATA\)group)]) in the code below).

#mean soil moisture for group = NOSM
mean(DATA$NOSM, na.rm=TRUE)

## [1] 99.03597

#standard error of the mean soil moisture for group = NOSM
sd(DATA$NOSM, na.rm=TRUE) /  sqrt(length(DATA$NOSM[!is.na(DATA$NOSM)]))

## [1] 7.844089

#mean soil moisture for group = SM
mean(DATA$SM, na.rm=TRUE)

## [1] 73.36357

#standard error of the mean soil moisture for group = SM
sd(DATA$SM, na.rm=TRUE) /  sqrt(length(DATA$SM[!is.na(DATA$SM)]))

## [1] 5.245563

You could also calculate the mean, standard deviation and standard error in Excel use the link below to watch a video: https://www.screencast.com/t/wFu7UJZbU3cg.

The first results statement that includes the statistical results in parentheses should also reference the figure you are referring to. Remember to always -refer to figures in the order in which they appear, and -include your results narrative before the figure.

Results Figure

You could present your data either with the McNeil plot or the profile plot but in ecology we want you to use the profile plot. The code to create the profile plot is below.

#code to create the profile plot for paired data
attach(DATA)

## The following objects are masked from DATA (pos = 3):
## 
##     NOSM, plot_num, SM

pd1<-with(DATA,paired(NOSM,SM))
plot(pd1,type="profile")

Figure legend

A caption must be included below each figure.

The caption for a paired t-test must include:

1.  A short descriptive title following the figure number  
2.  A description of what you plotted (including the type of error bars, if appropriate)  
3.  Your sample size (e.g., # transects/site)  
4.  The p-value.

Example Results Statement and Figure

Percent soil moisture content was significantly lower in soils collected under step moss than not under step moss (paired t-test, t = 4.41, df = 19, p < 0.001). Average (± SE) soil moisture under step moss (73.36% ± 5.25) was 35% lower than not under step moss (99.04% ± 7.84, Figure 3).

Figure 3. Percent soil moisture is signficantly lower in soils not covered by step moss (Hylocomium splendens, NOSM) than soils covered by step moss (SM) in the Hoh rainforest, WA (p < 0.001). Points represent the percent soil moisture under step moss and not under step moss in a paired design (n = 19)

My Data Are Not Normal - What Do I Do?

When you can’t meet the assumptions of the model, you can either:
1. Transform the data
2. Run a non-parametric test

Transforming your data means modifying the response variable so that the assumptions are met. This often means taking the square root or the log of the response variable and running the test again. The reason transformations work is that the relative distance between each replicate stays the same (sample 1 is lower than sample 2, for example) but the absolute distance between them is reduced. If your density plot is bimodal (two large humps in your curve), often a transformation will make those humps smaller. You still present your data using the raw numbers, not the transformed numbers.

To transform your data for the example above, you have to transform both sets of your response variable. Use the code below to do either a log transformation (first line of code) or a square root transformation (second line of code)

#log transformation of both response variables
logNOSM<-log(DATA$NOSM)
logSM<-log(DATA$SM)

#square root transformation of both response variables
sqrtNOSM<-sqrt(DATA$NOSM)
sqrtSM<-sqrt(DATA$SM)

Then you can run the analyses again using your new variables (logSM and logNOSM or sqrtSM and sqrtNOSM) in place of your original variables (SM and NOSM). If you transformed your response variables, return to Running the Paired T-test above and use your new variables.

If you can’t meet the assumptions even after transforming your data, you can use a non-parametric test. A non-parametric test does not assume an underlying distribution (such as the t-distribution) and, therefore, does not need to meet the assumption of normality as the parametric t-test does.

The non-parametric paired t-test is called a Wilcoxon signed-rank test. To run the Wilcoxon signed-rank test use the code below.

#code to run a Wilcoxon signed-rank test for paired data
wilcox.test(DATA$NOSM, DATA$SM, paired = TRUE)

## Warning in wilcox.test.default(DATA$NOSM, DATA$SM, paired = TRUE): cannot
## compute exact p-value with ties

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  DATA$NOSM and DATA$SM
## V = 194, p-value = 0.0009524
## alternative hypothesis: true location shift is not equal to 0

In your results statement, you need to include the W value and the p-value in parentheses after a statement of significance. For more information on why this is a non-parametric test and what the output means, see this website: https://data.library.virginia.edu/the-wilcoxon-rank-sum-test/.

Quick Paired T-test

#bring data into RStudio
DATA <- data.frame(read.table(file='SoilMoisture.csv', sep=',', header=TRUE, fill=TRUE, na.strings="."))

#check that data was loaded properly
names(DATA)

#explore data (look for outliers)
install.packages("PairedData")
library("PairedData")
attach(DATA)
pd1<-with(DATA,paired(NOSM,SM))
plot(pd1,type="McNeil")
plot(pd1,type="profile")

#run the model
ttest<-t.test(NOSM,SM,paired=TRUE, data=DATA) 

#check model assumptions
d<-DATA$NOSM - DATA$SM
shapiro.test(d)
plot(density(na.omit(d)))
qqnorm(d)
qqline(d, datax = FALSE, distribution = qnorm, probs = c(0.25, 0.75))

#see model summary
ttest

#results figure is the same as above in #explore data

*Written July 2018. Modified from http://stats.pugetsound.edu/ecology/