t-test

A t-test is a parametric test that compares two means and tells you if they are significantly different from each other. A t-test has two sampling variability assumptions that have to be met:

  1.    The residuals have to be normally distributed for each group;
  2.    There is equal variance among groups (homogeneity of variance).

The residuals are the leftover variation that is not accounted for by your explanatory variable. You can use analyses of the residuals to check both of these assumptions. Before we can interpret our statistical output, we need to make sure the assumptions above are met.

Background Theory

A t-test depends on the Student’s t distribution, which measures the difference between two groups over the difference within the groups. This is essentially measuring how different two means are over their variance. If the difference between your two groups is larger than the variation within groups (differences among means are large and variance within each mean is small), then the groups will be significantly different from each other.

Here are some statistical terms you should understand to interpret your t-test.

The t statistic is the ratio of the variance explained and the variance unexplained.
The p-value or alpha is the area under the curve of the t distribution at the t_critical value (Figure 1).The p-value, which is interpreted as the chance that you are wrong if you accept that there is a significant effect (also known as a Type I Error rate) has been chosen by scientists to be 5%.
The t_critical value is the value at which the area under the curve (i.e., p-value) is 5%.
The df is equal to the number of observations minus the number of parameters estimated (in this case #samples -1).

Most of the time, you are only testing if the values are different from each other, regardless of the direction so a two-tailed t-test is used, in which case the area under each side of the curve is 2.5% (0.025; Figure 2). If you are interested in if one mean is only larger or only smaller than another, then you use a one-tail t-test. If the t statistic generated by your test is larger than the t_critical value, the area under the curve is smaller than 5% and, therefore, the p-value is < 0.05, and the means are significantly different. You have explained more variance than random chance. And if you say that there is a significant effect and there really isn’t, you are wrong less than 5% of the time.

Figure 1. t distribution. Image: https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-the-differences-between-one-tailed-and-two-tailed-tests/

In the figure below, the t statistic can be conceptualized as the ratio of the difference in means (blue arrow) over the variance you can’t explain (red braces). If you divide the blue arrows in each figure by the red braces, the t statistic is much larger in the left figure than the right figure. That means you are explaining more variation than you aren’t explaining in the left figure (large difference between groups and small variation within groups) and the data in the left figure are statistically significant. The data in the right figure are not statistically significant because there is a small difference between groups (small blue arrow) and large variation within groups (large red braces).

Figure 2. Average (± SE) percent coverage of lichen under high light (<75% canopy cover) and low light (>75% canopy cover) on sycamore trees in Tacoma, WA. Data are used for illustrative purposes.

Information about the Dataset

For this test, we will be using a dataset on lichen cover on sycamore trees (Platanus occidentalis) in Tacoma, WA. The percentage of foliose lichen cover on sycamore trees was measured using the dot-intercept method where each lichen found under 100 randomly placed dots was recorded. Foliose lichen were recorded and totalled for each of 30 trees. We will be examining how the percentage cover of foliose lichen varies with light availability. Light was measured as percent canopy cover using a hand-held densiometer.

Below is a picture of some of the foliose lichen measured in this study.

Figure 3. Foliose lichen (light grey lichen) on a Sycamore tree in Tacoma, WA.

Loading the Data into RStudio

In order to make your RStudio script file organized, you will want to include some information at the top of the file. You can use the hashtag (#) to include things in your RStudio file that R can’t read.

For each test, you should include the following lines at the top of each RStudio file:

#question:
#response variable:
#explanatory variable:
#test name:

Below is what it would look like for the lichen example:

#question: how does light availability influence lichen percent cover?
#response variable: lichen percent cover (continuous)
#explanatory variable: light availability (categorical)
#test name: t-test

In order to run a t-test, you need to have your response variable in one column and your explanatory variable in another column so that there are replicates for each of your groups.

To bring data into RStudio, you can use the line of code below that will allow you to select any file directly from your computer.

Make sure your data file is saved as a .csv (comma separated file). (Mac users have to save their Excel files as “MS-DOS Comma Separated (.csv)” file using the File Format menu under the “save as” command.)

#code to load a datafile into R from your computer
DATA <- read.csv(file.choose(), na.strings=".")

An alternative way to open your file is to use the path directly to the folder on your computer where you set your working directory. This will allow you to choose a named file that you saved in your working directory (to learn more about what your working directory is or how to save files, go the the main webpage and click on Introduction to RStudio). In the DATA code line, TO.BE.EDITED is the name of your file.

#code to set the working directory folder on your computer
setwd <- ('C:/Users/YourUserName/Documents/Rfiles')
#code to load a datafile into R from your working directory
DATA <- read.csv(file="c:/TO.BE.EDITED.csv", header=TRUE, sep=",", na.strings=".")

You can download the lichen.csv file, save it in your working directory and then you can use the code below to upload the lichen.csv file into RStudio.

#code to load the datafile "lichen.csv" into R from your working directory
DATA <- data.frame(read.table(file='lichen.csv', sep=',', header=TRUE, fill=TRUE, na.strings="."))

To make sure it loaded correctly, you can copy and paste the following code to see the names of all of the columns in your file.

#code to see the names of the column titles
names(DATA)

## [1] "Tree"    "GrFo"    "SGFo"    "LGFo"    "Foliose" "Light"

If the file was successfully loaded into RStudio, you should have the names of the columns above. The ‘Foliose’ column lists the total cover of the three lichen morphotypes (GrFo, SGFo, and LGFo). Light is the categorical measure of light as either high or low. If you got an error, go back to the main stats webpage and click on Troubleshooting.

You can also use the following lines of code to make sure your data was read in correctly. The first line of code results in the top 6 rows of your data being displayed instead of just the column names, and the second line of code gives you the dimensions of your datafile.

#code to see the top 6 rows of your data file
head(DATA)

##   Tree GrFo SGFo LGFo Foliose Light
## 1    1    0    0    0      10  High
## 2    2   50    0    0      24  High
## 3    3    0    0    0      12  High
## 4    4   14    0    0      14  High
## 5    5    0   33    0      22  High
## 6    6    0   20    4      16  High

#code to see the dimensions (number of rows and columns) of your datafile
dim(DATA)

## [1] 30  6

Once you are sure the data are entered into R, you can proceed.

Exploring the Data

Before we run the test, it is a good idea to explore what the data look like. This gives you a chance to see if there are any outliers and to determine if your prediction based on your hypothesis seems supported by the data (presumably you would have some information about lichen before measuring their coverage on trees and running this analysis on what you predict to find). We will examine how lichen coverage varies with light. For this test, the percentage coverage of foliose lichen (Foliose) is the continuous response variable and light (Light) is the categorical explanatory variable.

Use the following code to create a box plot of your data. This will show you any outliers as it shows the maximum and minimum values within each group. See this page for more information on what a box plot is: http://www.physics.csbsju.edu/stats/box2.html

#code to create a boxplot of lichen data
boxplot(Foliose~Light, data=DATA, xlab = "Light level", ylab = "Percent cover of lichen")

Looking at these data it seems that there might be differences in percent coverage of lichen with light. It also shows no outliers (they would show up as dots outside of the bars).

Running the T-test

Now you can run the t-test. Below is the code you will use to run the model. We need to run a linear model first (lm) in order to get the residuals and assess the model’s assumptions. Remember that to actually run the test, you need to change the Response variable and Explanatory variable to your own variable names. Use the code for names(DATA) above to make sure you write in the variables exactly as they are in the data file. If you write in foliose instead of Foliose, for example, the test will not work because R won’t be able to find foliose in your datafile.

#code to run the t-test to test model assumptions
fit<-lm(Response variable~Explanatory variable, data=DATA)

#code to run the t-test for the lichen data to test model assumptions
fit<-lm(Foliose~Light, data=DATA)

You will notice that not much happened. If you didn’t get any error messages, that means that the test was successful. R runs the code and stores the information. You have to do more work to get at that the results of the test. But, first we need to make sure we can even look at the results.

You never look at the results of the test until AFTER you check that the assumptions of the model were met. So next we will examine if our data meet the assumptions of 1) residuals following a normal distribution and 2) equal variance among groups (the residuals from each group should be relatively similar).

Evaluating Model Assumptions

The difference between the observed and predicted values, called the residuals, is a measure of the error associated with each observation or the variation that is not explained by the explanatory variable. We plot the residuals in various ways to examine normality and homogeneity of variances.

To test model assumption 1 (whether the residuals are normally distributed), you will use 2 pieces of information:

  1. a Shapiro-Wilk test of normality for each group
  2. a density plot of residuals for each group

To create a density plot of residuals, use the code below.

#code to create the density plot of residuals
lattice::densityplot(~residuals(fit), group=Light, data=DATA, auto.key=TRUE)

Examine the density plot. Each line should follow more or less a bell-shaped curve. It is hard to determine whether or not your density plot is “normal” as it takes practice. If there are any large bumps or the curve is clearly skewed to one side or bimodal, chances are your residuals are not normal. Check out the examples of normal and non-normal density plots under the ANOVA page. If your density plot of residuals doesn’t look normal, you need to either transform your response variable or use a non-parametric test. See My data are not normal, what do I do? below for help.

To further determine if your density plots are normal, use the code below to run a Shapiro-Wilk normality test on each curve. Remember to change the explanatory variable and the groups of that variable when running your own test.

#code to run a Shapiro-wilk test of normality for the "High" treatment
with(DATA, shapiro.test(Foliose[Light == "High"]))

## 
##  Shapiro-Wilk normality test
## 
## data:  Foliose[Light == "High"]
## W = 0.92829, p-value = 0.3237

#code to run a Shapiro-wilk test of normality for the "Low" treatment
with(DATA, shapiro.test(Foliose[Light == "Low"]))

## 
##  Shapiro-Wilk normality test
## 
## data:  Foliose[Light == "Low"]
## W = 0.94112, p-value = 0.3317

Both of these tests show a p-value > 0.05, which means that the curves are not significantly different from a normal distribution. The first assumption of normality has been met, which means the data are normal.

To test model assumption 2 (whether the residuals of each group have equal variance), you will use:

1. a density plot of residuals for each group (same figure as 2. above)

Looking at the density plot for each group, you want to notice if the spread of the curves is about the same. If one curve is very narrow and the other curve is very wide, then the assumption of equal variance is not met. However, if the curves approximate the same normal distribution, then the assumption of equal variance is met. The case above shows that the spread of each group is pretty similar so the assumption of equal variance has been met.

Interpreting T-Test Output

If your assumptions have been met, use the code below to run the t-test. Because the second assumption of equal variance was met in this example, the line of code includes var.equal=TRUE. If the second assumption was not met, you could still run a t-test but it would be a variation of the t-test (called a Welch’s t-test) that does not assume equal variance. In that case, you would change the code below to var.equal=FALSE. Remember that if you needed to transform your response variable, you should use the transformed variable here.

#code to run the t-test
t.test(Foliose~Light, data=DATA, var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  Foliose by Light
## t = -15.652, df = 28, p-value = 2.241e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -42.33339 -32.53539
## sample estimates:
## mean in group High  mean in group Low 
##           17.15385           54.58824

According to the t-test output, there is a significant difference in foliose lichen percent coverage on trees in high and low light. You should copy and paste the summary into a Word doc or Excel file so you record it somewhere. For the results section of your paper, you should note the t-statistic, df and the p-value.

Presenting your Results

Results Statement

The results section of your paper should begin with a narrative of your results statements. These statements should be quantitative in nature and include 1. the statistical significance and 2. the biological significance.

Statistical Significance: Your first sentence should list the results of the statistical test (in this case whether foliose lichen were significantly influenced by light level). You include statistical data in parentheses at the end of the sentence only. You should never write “The p-value was…” or “The t-stat was, which means…”. You don’t write about your statistics, you just write the biological results and include your statistics in parentheses. This satisfies whether your results were statistically significant.
Biological Significance: For the quantitative statements, you should include means and standard errors of each group and their effect size. The mean values for each group are provided in your t-test output. You can more summary data using the code below.

To help you get means and standard errors, you can install the psych package, and then load it using the following code:

#code to install package "psych"
install.packages("psych")

#code to load package "psych" into R
library(psych)

## Warning: package 'psych' was built under R version 3.5.3

The following code shows you how you can get your response data summarized by your explanatory variable.

#code to get summary statistics of your response variable for each group of your explanatory variable
describeBy(DATA$response variable, group=DATA$explanatory variable)

For the lichen example, use the following code to get a summary of the data including means and standard errors.

#code to get summary statistics of Foliose percent cover by light availability levels
describeBy(DATA$Foliose, group=DATA$Light)

## 
##  Descriptive statistics by group 
## group: High
##    vars  n  mean   sd median trimmed  mad min max range  skew kurtosis
## X1    1 13 17.15 4.71     16   17.18 7.41  10  24    14 -0.04     -1.6
##      se
## X1 1.31
## -------------------------------------------------------- 
## group: Low
##    vars  n  mean   sd median trimmed  mad min max range  skew kurtosis
## X1    1 17 54.59 7.56     56   54.73 7.41  41  66    25 -0.31    -1.24
##      se
## X1 1.83

In this output, n = the number of replicates (i.e., sample size) for that species; mean = average; sd = standard deviation; se = standard error. You can use these data to create your results statements and figure (or you can calculate them in Excel). To learn how to calculate the mean, standard deviation and standard error in Excel use this link below to watch a video: https://www.screencast.com/t/wFu7UJZbU3cg or in R.

The first results statement that includes the statistical results in parentheses should also reference the figure you are referring to. Remember to always -refer to figures in the order in which they appear, and -include your results narrative before the figure.

Results Figure

A bar chart with means and standard error bars or a box plot are commonly used to report the results of a t-test. Standard convention varies by field but in ecology, we use a bar chart. A bar chart (or box plot) is used when your response variable is continuous and your explanatory variable is categorical; each group in your explanatory variable is represented by each bar. You can use Excel to create a bar chart.

If you want to learn how to create a box plot (for any field outside of ecology, for example), see the How to Create a Boxplot below.

Figure legend

A caption must be included below each figure.

The caption for a t-test must include:

1.  A short descriptive title following the figure number  
2.  A description of what you plotted (including the type of error bars, if appropriate)  
3.  Your sample size (e.g., # transects/site)  
4.  The p-value.

Example Results Statement and Figure (Excel was used to generate this figure)

Light availability significantly influenced the percentage cover of trees by foliose lichen (t-test, t = -15.65, df = 28, p < 0.001, Figure 4). Average (± SE) percentage cover of foliose lichen under low light (54.59 ± 1.83) was 218% higher than under high light (17.15 ± 1.31).

Figure 4. Average (± SE) percentage cover of foliose lichen on sycamore trees under high light (<75% canopy cover, n = 13) and low light (>75% canopy cover, n = 17) in Tacoma, WA. Average percent cover of foliose lichen under high light was significantly higher than under low light (p < 0.001).

My Data Are Not Normal - What Do I Do?

When you can’t meet the assumptions of the model, you can either:
1. Transform the data 2. Run a non-parametric test

Transforming your data means modifying the response variable so that the assumptions are met. This often means taking the square root or the log of the response variable and running the test again. The reason transformations work is that the relative distance between each replicate stays the same (sample 1 is lower than sample 2, for example) but the absolute distance between them is reduced. If your density plot is bimodal (two large humps in your curve), often a transformation will make those humps smaller. You still present your data using the raw numbers, not the transformed numbers.

To transform your data for the example above by taking the log (first line of code below) OR the square root (second line of code below) of your response variable (pick one or the other), you can use the following code:

#code to take the log of foliose percent cover
logFoliose = log10(DATA$Foliose)
#code to take the square root of foliose percent cover
sqrtFoliose = sqrt(DATA$Foliose)

Then you can run the analyses again using your new variables (logFoliose or sqrtFoliose) in place of your original variable (Foliose) for every line of code including for running the fit model to test assumptions and the t-test code to run the test.

If you can’t meet the assumptions even after transforming your data, you can use a non-parametric test. A non-parametric test does not assume an underlying distribution (such as the t-distribution) and, therefore, does not need to meet the assumption of normality as the parametric t-test does.

The non-parametric t-test is called a Mann Whitney U test. Use the code below to run a Mann Whitney U test:

#code to run a Mann Whitney U test on foliose percent cover by light level
wilcox.test(DATA$Foliose~DATA$Light)

## Warning in wilcox.test.default(x = c(10L, 24L, 12L, 14L, 22L, 16L, 11L, :
## cannot compute exact p-value with ties

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  DATA$Foliose by DATA$Light
## W = 0, p-value = 4.06e-06
## alternative hypothesis: true location shift is not equal to 0

The test is also referred to as a Wilcoxon rank sum test, which is why the code is wilcox.test. In your results statement, you need to include the W value and the p-value in parentheses after a statement of biological significance. For more information on why this is a non-parametric test and what the output means, see this website: https://data.library.virginia.edu/the-wilcoxon-rank-sum-test/.

Quick T-Test

Here is all of the code you need to quickly run the t-test.

#bring data into RStudio
DATA <- data.frame(read.table(file='lichen.csv', sep=',', header=TRUE, fill=TRUE, na.strings="."))

#check that data was loaded properly
names(DATA)

#explore data (look for outliers)
boxplot(Foliose~Light, data=DATA, xlab = "Light level", ylab = "Percent cover of lichen")

#run the model to get residuals
fit<-lm(Foliose~Light, data=DATA)

#check model assumptions
lattice::densityplot(~residuals(fit), group=Light, data=DATA, auto.key=TRUE)
with(DATA, shapiro.test(Foliose[Light == "High"]))
with(DATA, shapiro.test(Foliose[Light == "Low"]))

#run t-test
t.test(DATA$Foliose~DATA$Light, var.equal=TRUE)

#code to install package "psych"
install.packages("psych")
#code to load package "psych" into R
library(psych)
#code to get summary statistics of Foliose percent cover by light availability levels
describeBy(DATA$Foliose, group=DATA$Light)

Code to Create a Boxplot

Below is the code needed to generate a box-plot but remember that ecology generally doesn’t use box plots. To generate a box plot, you need to install and load the package “ggplot2” using the following code.

#code to install package "ggplot2"
install.packages("ggplot2")

#code to load package "ggplot2" into R
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.3

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

To create the box plot, use the code below.

#code to create a box plot of Foliose percent cover under high and low light
ggplot(DATA, aes(Light, Foliose, color = Light ))+ geom_boxplot() + theme_classic() +ylab ("Percent lichen cover")

You can do a lot with the ggplot function, including adding letters above each of the bars if your groups are significantly different. The code below first calls your plot “p” and then adds in letters as “text” wherever you set the x and y coordinates (e.g., y = level on the y-axis where you want to place the letter)

#code to create the box plot as above with letters over the bars
p<-ggplot(DATA, aes(Light, Foliose, color = Light ))+ geom_boxplot() + theme_classic() +ylab ("Percent lichen cover")
p + annotate("text", x = 1, y = 28, label = "a") + 
  annotate("text", x = 2, y = 70, label = "b")

*Written July 2018. Modified from http://stats.pugetsound.edu/ecology/