Introduction

Welcome to Part 6 of my Stats in R Tutorial series. If you missed the last tutorial, you can check it out at the following link. Last time we touched on our first inferential test, the z-test. Most will find that navigating to the t-test is fairly straightforward and familiar. With a few tweaks, you will be up and running with t-tests in no time. By the end of this tutorial you will be able to:

Understand the formulaic properties of the one-sample t-test.
Visualize a one-sample t-test.
Calculate Cohen’s d for the t-test.
Understand t-test assumptions.
Perform non-parametric alternatives.

Types of T-Tests

There are actually a number of t-tests that have been devised over the years. They include:

Student t-test
Independent samples t-test
Paired t-test
Wilcoxon test
Mann-Whitney U-test
Welch t-test

We will focus on just two tests in this tutorial, parametric and non-parametric one-sample t-tests. Having a good understanding of these two tests will go a long way to understanding t-tests in general.

Student T-Test

You may remember that the z-test in the last tutorial was simply a comparison of means between a population of interest and a representative sample from that population. However, z-tests are generally unrealistic to use in most practical applications because it requires knowledge of population values. Rarely does one have, for example, the mean IQ of every banker in the country. Instead, we often only have sample-level statistics and can only use these to draw inferences about a theoretical population. This requires some adjustments to our previous z-test.

Hand Calculation

Recall that the z-statistic from z-tests are derived with the following formula:

\[ z = \frac{\bar{x}-\mu}{\sigma_m} \]

where $\bar{x}$ is a sample mean, $\mu$ is a population mean, and $\sigma_m$ is the standard error. If we know the population mean but don’t know the population variance, we can use a Student t-test instead. This simply involves modifying the standard error term to include sample variance instead of population variance. The formula for standard error of a population, $\sigma_m$, is defined below (as explained in the last tutorial):

\[ \sigma_m = \frac{\sigma}{\sqrt{n}} \]

Now we just change the notation to reflect that it is instead sample variance we are using for our standard error term (and thus change our notation from $\sigma$ to $s$:

\[ s_m = \frac{s}{\sqrt{n}} \]

Now we just replace the standard error term in our z-test formula to obtain the t-statistic we desire:

\[ t = \frac{\bar{x}-\mu}{s_m} \]

Just like last time, we are simply subtracting a sample mean from a population mean, then dividing by the standard error. The only differences here are the sample-derived standard error and the distribution with which we will look up our associated percentile. Otherwise, the procedure is almost exactly the same as when we conducted our z-test.

To illustrate, let’s use some data native to R. This time, we will use the Orange dataset, which is a dataset containing the circumference of different orange trees. We will be using the tidyverse again, so as a preliminary step, we will load that as well. Finally, we will save Orange as a tibble named orange for tidier data, then inspect the data with the glimpse function.

library(tidyverse)
orange <- as_tibble(Orange)
glimpse(orange)

## Rows: 35
## Columns: 3
## $ Tree          <ord> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3,…
## $ age           <dbl> 118, 484, 664, 1004, 1231, 1372, 1582, 118, 484, 664, 10…
## $ circumference <dbl> 30, 58, 87, 115, 120, 142, 145, 33, 69, 111, 156, 172, 2…

We can see our dataset only has a few variables. We are really only concerned with the circumference of the trees, particularly how much they vary from all orange trees in general. Let’s check the mean and standard deviation of the circumference values (along with our sample size), then plot our distribution of values for all the trees in this sample.

#### Inspect Mean and SD of Circumference ####
orange.stats <- orange %>% 
  summarise(Mean = mean(circumference),
            SD = sd(circumference),
            N = nrow(orange))

orange.stats

## # A tibble: 1 × 3
##    Mean    SD     N
##   <dbl> <dbl> <int>
## 1  116.  57.5    35

#### Plot Mean and Distribution ####
orange %>% 
  ggplot(aes(x=circumference))+
  geom_histogram(fill="white",
                 color="black")+
  geom_vline(xintercept = orange.stats$Mean,
             color = "red")+
  labs(x="Circumference",
       y="Count",
       title="Circumference Distribution of Orange Tree Sample")

Our mean value for orange tree circumference is around 116 mm and our standard deviation is around 57 mm. Additionally, our distribution here isn’t particularly normal…something we will visit later. For now, we will simply run our t-test as prescribed above.

Perhaps we have reason to believe that these orange trees are special (due to the addition of fertilizers, weather conditions, etc.). We must first define our null hypothesis $H_0$ and alternative hypothesis $H_1$ before running our t-test so it is clear what we are testing. Since we believe that our special trees are not the same as ordinary orange trees, our two competing hypotheses could be restated as:

\[ H_0 = \text{There is no difference between our orange trees and orange trees in general.} \] \[ H_1 = \text{There is a difference between our orange trees and orange trees in general.} \]

Now that we have set up our hypotheses, we can determine what the average circumference is for orange trees and then plug in our sample data into the t-test formula from earlier. Let’s say the average orange tree circumference is 95. We already have the following data:

Population mean: 95
Sample mean: 116
Sample SD: 57
Sample size: 35

With this information, we can directly calculate our t-statistic. First, we calculate our standard error.

\[ s_m = \frac{s}{\sqrt{n}} = \frac{57}{\sqrt{35}} = 9.635 \] Then we just add this information (along with our other sample and population data) into our t-statistic formula.

\[ t = \frac{\bar{x}-\mu}{s_m} = \frac{116-95}{9.635} = 2.180 \] Of course you can do this fairly quickly in R with the following code:

#### Hand Craft Inputs ####
mu <- 95
sd <- orange.stats$SD
n <- orange.stats$N
m <- orange.stats$Mean

#### Calculate Standard Error ####
se <- sd/sqrt(n)
t <- (m-mu)/se
t

## [1] 2.146398

From here, we still do not know how to determine if this difference is statistically significant. Remember with a z-test that we also need to know two pieces of information: the degrees of freedom associated with our sample and the percentile associated with a statistically significant effect. If our alpha is set to $a = .05$, then we simply have to look for the degrees of freedom which match with our alpha. Since we have not set a specific direction for our null hypothesis, we will use a two-tail test to determine if our t-statistic is significant. If our hypothesis had directionality (we expect our sample mean to be greater/lower than the population mean), then we would instead use a one-tail test.

The degrees of freedom for our t-test here are determined by calculating $n-1$. Therefore, our degrees of freedom for this particular test are $34$. Looking at the t-table here, we can see that we only have associated t-statistics for $df = 30$ and $df = 40$. However, we can see that even with $df = 40$, our t-statistic of $2.1464$ surpasses the cutoff needed to be statistically significant. Therefore, we can reject the null hypothesis (that there is no difference between our sample and population means) because the probability of finding a value this high above our cutoff is low. To get a more exact approximation of a p-value for our t-statistic, we can use the pt function, which estimates a probability based off what we feed it.

p <- 2*pt(2.1464, 
          34,
          lower.tail=F)
p

## [1] 0.03906306

Just as a quick breakdown…we feed the function pt with a t-statistic and our degrees of freedom for the first two arguments. Then we set the lower.tail argument to F because we want the probability of exceeding our t-statistic. Finally, because we are using a two-tail test, we calculate the associated value in both regions of our t-test by simply multiplying by two. The result is $p = .039$.

T-Test Function in R

Of course this entire process can be sped up in R, as R naturally includes a native function called t.test that simplifies this. However, that does not mean that you should apply this function without thinking about what R is actually doing with your data. R functions have a number of default arguments which must be considered when using them, and it is important to know what is going on under the hood. Thankfully, our specific application of the t-test is easy and straightforward in the t.test function. We simply supply the vector of values (our sample) into the x argument and use our previous $\mu$ population mean for the mu argument. Since we have saved mu as an object earlier, I will simply add it here, but you can just as easily use mu = 95 explicitly.

t.test(
  x = orange$circumference,
  mu = mu
)

## 
##  One Sample t-test
## 
## data:  orange$circumference
## t = 2.1464, df = 34, p-value = 0.03906
## alternative hypothesis: true mean is not equal to 95
## 95 percent confidence interval:
##   96.10926 135.60502
## sample estimates:
## mean of x 
##  115.8571

R will first determine which t-test you have used, which will be a partial clue that you haven’t completely botched your test in some way. Here it has correctly specified our t-test as a one-sample t-test. It then determines which data was used, our orange$circumference vector. It then provides summary information used to derive significance…our t-statistic, our degrees of freedom, and the p-value associated with the two. Here we can see that they all match with our previous calculation, and our test is significant like before. It also provides confidence intervals and the sample mean.

Effect Size

It is important to also report the effect size for the t-test, as statistical significance says nothing about the magnitude of the effect. Cohen’s $d$ is defined by the following formula:

\[ d = \frac{\bar{x}-\mu}{s} \] Again, we already have these values, which makes their calculation straightforward.

d <- (m-mu)/sd
d

## [1] 0.3628075

Our corresponding Cohen’s $d$ value is $.36$, which is considered a small effect. So while our effect is statistically significant, the effect is fairly small. One can also quickly obtain Cohen’s $d$ with the effectsize package.

library(effectsize)
cohens_d(x = orange$circumference,
         mu = 95)

## Cohen's d |       95% CI
## ------------------------
## 0.36      | [0.02, 0.70]
## 
## - Deviation from a difference of 95.

As expected, the values match.

Visualizing One-Sample T-Tests

Because you are not comparing against any other sample or repeated measures, one can simply report the results of the t-test in APA format and determine what those results mean. One can at minimum show a boxplot of the values in the sample and the associated metrics from the t-test using ggplot.

I provide one example below. Most of this code should be familiar from previous tutorials. We plug our t-test values into the subtitle argument of labs here to include the test values at the top of our boxplot. Because this isn’t a grouped boxplot, we can also remove the x-axis by including element_blank in the axis codes below. By using element_text, we can also change the font to bold or italics using the face argument.

orange %>% 
  ggplot(aes(y=circumference))+
  geom_boxplot(fill = "steelblue",
               alpha = .4,
               linewidth = 1)+
  labs(y="Circumference (mm)",
       title = "Circumference of Orange Trees",
       subtitle = "t = 2.1464, df = 34, p = 0.039, d = 0.36")+
  theme(plot.title = element_text(face = "bold"),
        plot.subtitle = element_text(face = "italic"),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank())

We can customize our plot a bit more by supplying some jittered data points with geom_jitter and show the population mean in a bar with geom_hline, which provides an arbitrary horizontal line we can set to any value.

orange %>% 
  ggplot(aes(y=circumference))+
  geom_hline(color = "darkred",
             alpha = .4,
             linewidth = 2,
             yintercept = 95)+
  geom_boxplot(fill = "steelblue",
               alpha = .4,
               linewidth = 1)+
  labs(y="Circumference (mm)",
       caption = "Red line represents population mean for comparison.",
       title = "Circumference of Orange Trees",
       subtitle = "t = 2.1464, df = 34, p = 0.039, d = 0.36")+
  theme(plot.title = element_text(face = "bold"),
        plot.subtitle = element_text(face = "italic"),
        plot.caption = element_text(face = "italic"),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank())+
  geom_jitter(aes(x=0,
                 y=circumference),
              width = .03)

Assumptions

There are a couple assumptions that must be met before performing a one-sample t-test. They are the following:

The sample should be normally distributed. If the sample has a non-normal distribution, especially for a low sample size, it will potentially make the t-test less accurate. However, departure from this assumption is often overcome with large sample sizes.
The sample should contain independent observations. For example, let’s say we have a sample of dogs, of which weights are measured for each. However, our sample contains one group of Great Danes, while another group consists of chihuahuas. The Great Danes will be consistently larger while the chihuahuas will be consistently smaller. This will cause obvious issues with accurately determining sample attributes and consequently alters how we interpret an associated t-test.

As noted earlier, we saw that the distribution appeared abnormal. Many of the bins in the histogram appeared isolated from each other and the majority of data points were on the left side of the distribution. We also had a fairly low sample size of $n = 35$, so departures from normality may influence our test more than some other samples. To quickly visualize if the data is Gaussian distributed, we can use a QQ plot, which compares theoretical quantiles to the actual quantiles our observed data is placed in. Abnormal or nonlinear patterns in the QQ plot typically indicate departures from normality. Basically, if the data points in the plot mostly resemble a straight line, we can assume our data is normally distributed.

To check this, we can use the qqnorm and qqline functions in R. The former draws a QQ plot while the latter draws the line which goes between the data points.

qqnorm(orange$circumference)
qqline(orange$circumference)

We can see that our QQ plot has some heavy tails (very curved at both ends) and there are some weird isolated patterns on the line (some points seem very grouped together and isolated). This is actually an indication that we may have also violated our independence assumption, because this behavior is typically present when there are relationships between data points. We will ignore this for now, but know that when you see this, you should typically investigate why and consider different options depending on the nature of the influence. For now, we know our data is not normal and will proceed with a non-parametric version of the t-test instead.

Wilcoxon Signed Rank Test

Non-parametric tests are those which do not assume a Gaussian (aka normal) distribution when conducting them. In the case of the Wilcoxon test, this is typically avoided by using ranks for data points rather than mean comparisons alone. Running a Wilcoxon in R is similar to the t.test function. Simply run a vector and population mean, then specify the direction of the test (we again use a two-tail test by assigning alternative = "two.sided").

wilcox.test(orange$circumference,
            mu = mu,
            alternative = "two.sided",
            exact = F)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  orange$circumference
## V = 432.5, p-value = 0.05527
## alternative hypothesis: true location is not equal to 95

I have disabled the warnings from running this code by using exact=F. Otherwise, R will warn you if there are ties in ranks (more than one data point will share the same value), as is the case with our data. This makes the p-value an approximation rather than a direct value obtained from no ties. In any case, our data is now only marginally significant. Typically non-parametric tests are less powerful than standard t-tests, so this isn’t all that surprising. However, it is still important to report in this case that the mean values were larger in the sample in order to accurately reflect sample attributes.

Conclusion

Thank you for reading this tutorial, and I hope you have taken something from today’s material. If you felt that any part of this tutorial was helpful, and you would like to support me, please consider buying me a cup of coffee.

Check out my other RPubs if you want to learn more about stats in R.

We have covered a lot today. As mentioned previously, I advise playing around with this code with some other data to get in the habit of using t-tests in R. In the meantime, keep an eye out for some new tutorials here down the road, which will all be made available either here or my personal website. Take care and happy coding!

Stats in R - Part 6: One Sample T-Tests

Shawn Hemelstrand

2023-05-10