Welcome to Part 6 of my Stats in R Tutorial series. If you missed the last tutorial, you can check it out at the following link. Last time we touched on our first inferential test, the z-test. Most will find that navigating to the t-test is fairly straightforward and familiar. With a few tweaks, you will be up and running with t-tests in no time. By the end of this tutorial you will be able to:
There are actually a number of t-tests that have been devised over the years. They include:
We will focus on just two tests in this tutorial, parametric and non-parametric one-sample t-tests. Having a good understanding of these two tests will go a long way to understanding t-tests in general.
You may remember that the z-test in the last tutorial was simply a comparison of means between a population of interest and a representative sample from that population. However, z-tests are generally unrealistic to use in most practical applications because it requires knowledge of population values. Rarely does one have, for example, the mean IQ of every banker in the country. Instead, we often only have sample-level statistics and can only use these to draw inferences about a theoretical population. This requires some adjustments to our previous z-test.
Recall that the z-statistic from z-tests are derived with the following formula:
\[ z = \frac{\bar{x}-\mu}{\sigma_m} \]
where \(\bar{x}\) is a sample mean, \(\mu\) is a population mean, and \(\sigma_m\) is the standard error. If we know the population mean but don’t know the population variance, we can use a Student t-test instead. This simply involves modifying the standard error term to include sample variance instead of population variance. The formula for standard error of a population, \(\sigma_m\), is defined below (as explained in the last tutorial):
\[ \sigma_m = \frac{\sigma}{\sqrt{n}} \]
Now we just change the notation to reflect that it is instead sample variance we are using for our standard error term (and thus change our notation from \(\sigma\) to \(s\):
\[ s_m = \frac{s}{\sqrt{n}} \]
Now we just replace the standard error term in our z-test formula to obtain the t-statistic we desire:
\[ t = \frac{\bar{x}-\mu}{s_m} \]
Just like last time, we are simply subtracting a sample mean from a population mean, then dividing by the standard error. The only differences here are the sample-derived standard error and the distribution with which we will look up our associated percentile. Otherwise, the procedure is almost exactly the same as when we conducted our z-test.
To illustrate, let’s use some data native to R. This time, we will
use the Orange dataset, which is a dataset containing the
circumference of different orange trees. We will be using the
tidyverse again, so as a preliminary step, we will load
that as well. Finally, we will save Orange as a tibble
named orange for tidier data, then inspect the data with
the glimpse function.
library(tidyverse)
orange <- as_tibble(Orange)
glimpse(orange)
## Rows: 35
## Columns: 3
## $ Tree <ord> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3,…
## $ age <dbl> 118, 484, 664, 1004, 1231, 1372, 1582, 118, 484, 664, 10…
## $ circumference <dbl> 30, 58, 87, 115, 120, 142, 145, 33, 69, 111, 156, 172, 2…
We can see our dataset only has a few variables. We are really only concerned with the circumference of the trees, particularly how much they vary from all orange trees in general. Let’s check the mean and standard deviation of the circumference values (along with our sample size), then plot our distribution of values for all the trees in this sample.
#### Inspect Mean and SD of Circumference ####
orange.stats <- orange %>%
summarise(Mean = mean(circumference),
SD = sd(circumference),
N = nrow(orange))
orange.stats
## # A tibble: 1 × 3
## Mean SD N
## <dbl> <dbl> <int>
## 1 116. 57.5 35
#### Plot Mean and Distribution ####
orange %>%
ggplot(aes(x=circumference))+
geom_histogram(fill="white",
color="black")+
geom_vline(xintercept = orange.stats$Mean,
color = "red")+
labs(x="Circumference",
y="Count",
title="Circumference Distribution of Orange Tree Sample")
Our mean value for orange tree circumference is around 116 mm and our standard deviation is around 57 mm. Additionally, our distribution here isn’t particularly normal…something we will visit later. For now, we will simply run our t-test as prescribed above.
Perhaps we have reason to believe that these orange trees are special (due to the addition of fertilizers, weather conditions, etc.). We must first define our null hypothesis \(H_0\) and alternative hypothesis \(H_1\) before running our t-test so it is clear what we are testing. Since we believe that our special trees are not the same as ordinary orange trees, our two competing hypotheses could be restated as:
\[ H_0 = \text{There is no difference between our orange trees and orange trees in general.} \] \[ H_1 = \text{There is a difference between our orange trees and orange trees in general.} \]
Now that we have set up our hypotheses, we can determine what the average circumference is for orange trees and then plug in our sample data into the t-test formula from earlier. Let’s say the average orange tree circumference is 95. We already have the following data:
With this information, we can directly calculate our t-statistic. First, we calculate our standard error.
\[ s_m = \frac{s}{\sqrt{n}} = \frac{57}{\sqrt{35}} = 9.635 \] Then we just add this information (along with our other sample and population data) into our t-statistic formula.
\[ t = \frac{\bar{x}-\mu}{s_m} = \frac{116-95}{9.635} = 2.180 \] Of course you can do this fairly quickly in R with the following code:
#### Hand Craft Inputs ####
mu <- 95
sd <- orange.stats$SD
n <- orange.stats$N
m <- orange.stats$Mean
#### Calculate Standard Error ####
se <- sd/sqrt(n)
t <- (m-mu)/se
t
## [1] 2.146398
From here, we still do not know how to determine if this difference is statistically significant. Remember with a z-test that we also need to know two pieces of information: the degrees of freedom associated with our sample and the percentile associated with a statistically significant effect. If our alpha is set to \(a = .05\), then we simply have to look for the degrees of freedom which match with our alpha. Since we have not set a specific direction for our null hypothesis, we will use a two-tail test to determine if our t-statistic is significant. If our hypothesis had directionality (we expect our sample mean to be greater/lower than the population mean), then we would instead use a one-tail test.
The degrees of freedom for our t-test here are determined by
calculating \(n-1\). Therefore, our
degrees of freedom for this particular test are \(34\). Looking
at the t-table here, we can see that we only have associated
t-statistics for \(df = 30\) and \(df = 40\). However, we can see that even
with \(df = 40\), our t-statistic of
\(2.1464\) surpasses the cutoff needed
to be statistically significant. Therefore, we can reject the null
hypothesis (that there is no difference between our sample and
population means) because the probability of finding a value this high
above our cutoff is low. To get a more exact approximation of a p-value
for our t-statistic, we can use the pt function, which
estimates a probability based off what we feed it.
p <- 2*pt(2.1464,
34,
lower.tail=F)
p
## [1] 0.03906306
Just as a quick breakdown…we feed the function pt with a
t-statistic and our degrees of freedom for the first two arguments. Then
we set the lower.tail argument to F because we
want the probability of exceeding our t-statistic. Finally, because we
are using a two-tail test, we calculate the associated value in both
regions of our t-test by simply multiplying by two. The result is \(p = .039\).
Of course this entire process can be sped up in R, as R naturally
includes a native function called t.test that simplifies
this. However, that does not mean that you should apply this
function without thinking about what R is actually doing with your data.
R functions have a number of default arguments which must be considered
when using them, and it is important to know what is going on under the
hood. Thankfully, our specific application of the t-test is easy and
straightforward in the t.test function. We simply supply
the vector of values (our sample) into the x argument and
use our previous \(\mu\) population
mean for the mu argument. Since we have saved
mu as an object earlier, I will simply add it here, but you
can just as easily use mu = 95 explicitly.
t.test(
x = orange$circumference,
mu = mu
)
##
## One Sample t-test
##
## data: orange$circumference
## t = 2.1464, df = 34, p-value = 0.03906
## alternative hypothesis: true mean is not equal to 95
## 95 percent confidence interval:
## 96.10926 135.60502
## sample estimates:
## mean of x
## 115.8571
R will first determine which t-test you have used, which will be a
partial clue that you haven’t completely botched your test in some way.
Here it has correctly specified our t-test as a one-sample t-test. It
then determines which data was used, our
orange$circumference vector. It then provides summary
information used to derive significance…our t-statistic, our degrees of
freedom, and the p-value associated with the two. Here we can see that
they all match with our previous calculation, and our test is
significant like before. It also provides confidence intervals and the
sample mean.
It is important to also report the effect size for the t-test, as statistical significance says nothing about the magnitude of the effect. Cohen’s \(d\) is defined by the following formula:
\[ d = \frac{\bar{x}-\mu}{s} \] Again, we already have these values, which makes their calculation straightforward.
d <- (m-mu)/sd
d
## [1] 0.3628075
Our corresponding Cohen’s \(d\)
value is \(.36\), which is considered a
small effect. So while our effect is statistically significant, the
effect is fairly small. One can also quickly obtain Cohen’s \(d\) with the effectsize
package.
library(effectsize)
cohens_d(x = orange$circumference,
mu = 95)
## Cohen's d | 95% CI
## ------------------------
## 0.36 | [0.02, 0.70]
##
## - Deviation from a difference of 95.
As expected, the values match.
Because you are not comparing against any other sample or repeated
measures, one can simply report the results of the t-test in APA format
and determine what those results mean. One can at minimum show a boxplot
of the values in the sample and the associated metrics from the t-test
using ggplot.
I provide one example below. Most of this code should be familiar
from previous tutorials. We plug our t-test values into the
subtitle argument of labs here to include the
test values at the top of our boxplot. Because this isn’t a grouped
boxplot, we can also remove the x-axis by including
element_blank in the axis codes below. By using
element_text, we can also change the font to bold or
italics using the face argument.
orange %>%
ggplot(aes(y=circumference))+
geom_boxplot(fill = "steelblue",
alpha = .4,
linewidth = 1)+
labs(y="Circumference (mm)",
title = "Circumference of Orange Trees",
subtitle = "t = 2.1464, df = 34, p = 0.039, d = 0.36")+
theme(plot.title = element_text(face = "bold"),
plot.subtitle = element_text(face = "italic"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank())
We can customize our plot a bit more by supplying some jittered data
points with geom_jitter and show the population mean in a
bar with geom_hline, which provides an arbitrary horizontal
line we can set to any value.
orange %>%
ggplot(aes(y=circumference))+
geom_hline(color = "darkred",
alpha = .4,
linewidth = 2,
yintercept = 95)+
geom_boxplot(fill = "steelblue",
alpha = .4,
linewidth = 1)+
labs(y="Circumference (mm)",
caption = "Red line represents population mean for comparison.",
title = "Circumference of Orange Trees",
subtitle = "t = 2.1464, df = 34, p = 0.039, d = 0.36")+
theme(plot.title = element_text(face = "bold"),
plot.subtitle = element_text(face = "italic"),
plot.caption = element_text(face = "italic"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_blank())+
geom_jitter(aes(x=0,
y=circumference),
width = .03)
There are a couple assumptions that must be met before performing a one-sample t-test. They are the following:
As noted earlier, we saw that the distribution appeared abnormal. Many of the bins in the histogram appeared isolated from each other and the majority of data points were on the left side of the distribution. We also had a fairly low sample size of \(n = 35\), so departures from normality may influence our test more than some other samples. To quickly visualize if the data is Gaussian distributed, we can use a QQ plot, which compares theoretical quantiles to the actual quantiles our observed data is placed in. Abnormal or nonlinear patterns in the QQ plot typically indicate departures from normality. Basically, if the data points in the plot mostly resemble a straight line, we can assume our data is normally distributed.
To check this, we can use the qqnorm and
qqline functions in R. The former draws a QQ plot while the
latter draws the line which goes between the data points.
qqnorm(orange$circumference)
qqline(orange$circumference)
We can see that our QQ plot has some heavy tails (very curved at both ends) and there are some weird isolated patterns on the line (some points seem very grouped together and isolated). This is actually an indication that we may have also violated our independence assumption, because this behavior is typically present when there are relationships between data points. We will ignore this for now, but know that when you see this, you should typically investigate why and consider different options depending on the nature of the influence. For now, we know our data is not normal and will proceed with a non-parametric version of the t-test instead.
Non-parametric tests are those which do not assume a
Gaussian (aka normal) distribution when conducting them. In the case of
the Wilcoxon test, this is typically avoided by using ranks for data
points rather than mean comparisons alone. Running a Wilcoxon in R is
similar to the t.test function. Simply run a vector and
population mean, then specify the direction of the test (we again use a
two-tail test by assigning alternative = "two.sided").
wilcox.test(orange$circumference,
mu = mu,
alternative = "two.sided",
exact = F)
##
## Wilcoxon signed rank test with continuity correction
##
## data: orange$circumference
## V = 432.5, p-value = 0.05527
## alternative hypothesis: true location is not equal to 95
I have disabled the warnings from running this code by using
exact=F. Otherwise, R will warn you if there are ties in
ranks (more than one data point will share the same value), as is the
case with our data. This makes the p-value an approximation rather than
a direct value obtained from no ties. In any case, our data is now only
marginally significant. Typically non-parametric tests are less powerful
than standard t-tests, so this isn’t all that surprising. However, it is
still important to report in this case that the mean values were larger
in the sample in order to accurately reflect sample attributes.
Thank you for reading this tutorial, and I hope you have taken something from today’s material. If you felt that any part of this tutorial was helpful, and you would like to support me, please consider buying me a cup of coffee.
Check out my other RPubs if you want to learn more about stats in R.
We have covered a lot today. As mentioned previously, I advise playing around with this code with some other data to get in the habit of using t-tests in R. In the meantime, keep an eye out for some new tutorials here down the road, which will all be made available either here or my personal website. Take care and happy coding!