Hypothesis testing: z- and t-tests

Author

Shota Momma

So far, we have dealt exclusively with categorical data, where an outcome can be classified into discrete categories, such as success or failure or faces of a die with no inherent ordering. Here, we will introduce a widespread statistical test known as t-tests, used for analyzing non-categorical (continuous) data. Like before, the goal is not really to memorize the procedures for conducting various sub-types of t-tests, but to understand the underlying logic behind significance testing.

Before introducing t-tests, which are useful, I want to introduce a mostly useless test known as the one-sample z-test. The reason that I talk about a useless test is not because I want to torture you, but because it is useful for introducing concepts relevant for understanding t-tests, which are useful.

The one-sample z-test

Suppose you are interested in whether emotionally valant words are recognized faster than other words. So you measured the reaction time 10 emotionally valiant words in a lexical decision task, and took the average reaction time of each item.

set.seed(1)
rts <- rnorm(10, mean = 520, sd = 30)
mean(rts)

[1] 523.9661

The sample mean is roughly 523ms. Now, suppose you somehow knows that words in general should have the average reaction time of 550 ms. Suppose further you know the true standard deviation is 30ms (don’t ask me how we can know the true population standard deviation - this is almost always unknown, and that’s why z-test is mostly useless). Suppose further that we know, for whatever reasons, that the average reaction time of a word follows the normal distribution.

Say you want to know if emotionally valant words are recognized faster than words in general. This is when z-test is used. First, let’s define the null and alternative hypotheses:

H0: \(u_1 = u_0 = 550\)
H1: \(u_1 ≠ u_0 = 550\)

The null hypothesis says that the true mean of the reaction times of emotionally valant words is 550ms, the same as the general population of words The alternative hypothesis says that it is not.

Here, our test statistic is the sample mean, \(\bar{X}\). Remember that we assume that the mean reaction time of a word is normally distributed. Thus, a \(\bar{X}\) should be a random variable drawn from a normal distribution with the mean of \(u_0\), that is, 550ms. What about the standard deviation? The standard deviation of the sample means is known as Standard error of the mean (often abbreviated as SE), which can be computed by dividing a standard deviation of the general population by the square root of the sample size:

\[ SE(\bar{X}) = \frac{\sigma}{\sqrt{n}} \]

Thus, under the null hypothesis we can say that the sample means are a random variable drawn from the following normal distribution:

\[ \bar{X} = normal(u_0, SE(\bar{X})) \]

Now, we can standardize the sample mean. Standardizing is done by first subtracting a mean from a value and divide the resulting value by the standard deviation. Thus, the standardized version of the sample mean can be computed by subtracting the true mean from the sample mean and dividing the resulting value by the standard deviation of the sample mean, that is, the SE.

\[ z_{\bar{X}} = \frac{\bar{X} - \mu_0}{SE(\bar{X})} \]

Why do we standardize sample means? The reason is because the standardized sample mean is a random variable drawn from a very conveniently looking normal distribution:

\[ z_{\bar{X}} \sim normal(0, 1) \]

Thus, z value of 1 means that the observed sample mean is 1 standard error away from the population mean under the null distribution (normal(0,1)).

Now, let’s compute the z value of the sample mean given the assumption above.

sample_mean = mean(rts)
pop_mean = 550
pop_sd = 30
n = length(rts)
SE = pop_sd/sqrt(n)
z = (sample_mean-pop_mean)/SE
z

[1] -2.744216

The z value is roughly -2.744. Now, let’s say that we set the alpha to be 0.05. We can figure out the critical region given the alpha = 0.05, using qnorm() function, just like we could figure out the critical region for the other distributions. If we want to run a two-sided test, we want the probability mass of 0.025 in each side of the distribution:

abs(qnorm(0.025,0,1))

[1] 1.959964

This gives us roughly 1.96 (both positive and negative) as the critical value. Visually:

curve(dnorm(x, 0,1), col='black', from=-4,to=4)
abline(v = 1.98, col="red", lwd=3, lty=2)
abline(v = -1.98, col="red", lwd=3, lty=2)
text(-3, 0.2, "critical region <- ", col="red", cex = 1)
text(3, 0.2, "-> critical region ", col="red", cex = 1)

Because the absolute value of -2.744 is above 1.96, we reject H\(_0\). We conclude that emotionally valant words are responded faster in general.

What’s the exact p-value? Remember that p-values mean the probability of getting the results at least as extreme as the observed result assuming that the null hypothesis is true. Thus, we can calculate the p-value by summing up the probability density corresponding to the z-values that is more than 2.744 and less than -2.744 assuming normal(0,1). To do so, we can simply use the pnorm() function.

pnorm(2.744,0,1,lower.tail = FALSE)*2

[1] 0.006069554

This should match up with the p-value computed by using the pre-built function from the BSDA pacakge:

#install.packages(BSDA) - z-test function is not base-R function
library(BSDA)

Loading required package: lattice


Attaching package: 'BSDA'

The following object is masked from 'package:datasets':

    Orange

z.test(rts, mu=550, sigma.x = 30)


    One-sample z-Test

data:  rts
z = -2.7442, p-value = 0.006066
alternative hypothesis: true mean is not equal to 550
95 percent confidence interval:
 505.3722 542.5599
sample estimates:
mean of x 
 523.9661

the t-test

I keep telling you that z-test is useless, because one of the assumptions we have to make is that we know the true standard deviation of the population, which is usually, if not always, unknown. There is a more useful version of the z-test, known as the one-sample t-test, which is very much like z-test except that we don’t have to pretend to know the true standard deviation of the population. In one-sample t-test, our test statistic is t-values, and t-values can be computed using the following formula:

\[ t = \frac{\bar{X} - \mu}{\hat{\sigma}/\sqrt{N}} \] As you may notice, this equation is almost identical to the equation for z-values. The only difference is that the standard deviation(\(\sigma\)) is now an estimate based on the sample (and thus the notation changed to \(\hat{\sigma}\) as opposed to just \(\sigma\).

T-values follow the distribution known as the t-distribution, which has the only one parameter: degrees of freedom. Visually:

curve(dt(x,1), xlab = 't value', ylab = 'p', col = rgb(0, 0, 1, 0.8), from=-4,to=4, ylim = c(0,0.5))
curve(dt(x,10),xlab = 't value', ylab = 'p', col = rgb(1, 0, 0, 0.8), from=-4,to=4, add = TRUE)
curve(dt(x,100), xlab = 't value', ylab = 'p', col = rgb(0, 1, 0, 0.8), from=-4,to=4, add = TRUE)

text(2.5, 0.15, "df = 1", col=rgb(0, 0, 1, 1), cex = 1.5)
text(2.5, 0.2, "df = 10", col=rgb(1, 0, 0, 1), cex = 1.5)
text(2.5, 0.25, "df = 100", col=rgb(0, 1, 0, 1), cex = 1.5)

As you may notice, as the degrees of freedom gets larger, t-distribution approaches normal(0,1).

Let’s compute the t-value using the reaction time example above:

pop_mean = 550
sample_mean = mean(rts)
sd_est = sd(rts)
n = length(rts)
t = (sample_mean-pop_mean)/(sd_est/sqrt(n))
t

[1] -3.515584

The p-value can be computed by calculating the probability of getting the results at least as extreme as the observed one. Using the pt() function:

deg = length(rts) - 1
pt(t, deg)*2

[1] 0.006560794

This matches with the output of the built-in function: t.test().

t.test(rts, mu=550)


    One Sample t-test

data:  rts
t = -3.5156, df = 9, p-value = 0.006561
alternative hypothesis: true mean is not equal to 550
95 percent confidence interval:
 507.2142 540.7180
sample estimates:
mean of x 
 523.9661

Independent sample t-test (aka Student test)

Now, we are ready to talk about the independent sample t-test, which is also useful for us. Let’s say that we are again comparing the reaction times of emtionally valant words and other words. But this time, instead of assuming that the mean reaction time of other words is 550ms, we compare two samples, a sample of emotionally valant words and the sample of other words.

set.seed(1)
rt_emo <- rnorm(10,530,30)
rt_other <- rnorm(10,550,30)

Just like before, you can first define the null hypothesis. This time, the null hypothesis is that there is no reaction time difference between two types of words In a math equation:

H0: \(u1 = u2\)
H1: \(u1 ≠ u2\)

Now, how do we test the null hypothesis? Here, the relevant test statistics is known as t-values, which are defined as follows:

\[ t = \frac{\bar{X}_1 - \bar{X}_2}{SE} \]

As we discussed above, SE can be computed by dividing an estimated standard deviation (sample standard deviation) by the square root of N. However, we now have two samples, meaning that we have two N’s and two sample standard deviations. How do we have a single SE when there are two sample standard deviations? Well, we can take a weighted average of the two sample standard deviations to calculate the pooled standard deviation estimate, \(\hat{\sigma}_p\):

\[ \hat{\sigma}_p = \sqrt{\frac{w_1\hat{\sigma_1}^2 + w_2\hat{\sigma_2}^2}{w_1 + w_2}} \]

,where w1 is the sample size - 1 of the first group, and w2 is the sample size - 1 of the second group. This ugly-looking formula is just the squared root of the weighted sum of variances. If you manipulate this formula a bit and replace w1 and w2 with (\(N_1\) - 1) and (\(N_2\) - 1) respectively, you get the formula you will see in many statistics textbooks:

\[ \hat{\sigma}_p = \sqrt{\frac{(N_1 - 1)\hat{\sigma}_1^2 + (N_2 - 1)\hat{\sigma}_2^2}{(N_1 + N_2) - 2}} \]

Now, we can calculate SE using the following: \[ SE = \hat{\sigma}\sqrt{\frac{1}{N_1}+\frac{1}{N_2}} \] This can be plugged into the formula for t-value above.

Just to make this concrete, let’s compute the t-value of the following two samples, each generated from a normal distribution with different means (if you like concrete stuff, treat them as reaction times of two different kinds of words).

set.seed(1)
samp1 <- rnorm(100, mean = 500, sd = 50)
samp2 <- rnorm(100, mean = 525, sd = 50)

n_1 <- length(samp1)
n_2 <- length(samp2)
var_1 <- var(samp1)
var_2 <- var(samp2)
mean_1 <- mean(samp1)
mean_2 <- mean(samp2)

sigma_p <- sqrt(((n_1 - 1)*var_1 + (n_2-1)*var_2)/(n_1+n_2 - 2))
SE <- sigma_p*sqrt(1/n_1 + 1/n_2)
diff <- mean_1 - mean_2
t = diff/SE
t

[1] -2.690565

Now that we have the t-value, we can compute the p-value. This step is identical to the one we used in the one-sample t-test:

deg <- n_1 + n_2 - 2
pt(t,deg)*2

[1] 0.007742591

Let’s make sure this agrees with the output of the t.test() function in R:

t.test(samp1,samp2, var.equal = TRUE)


    Two Sample t-test

data:  samp1 and samp2
t = -2.6906, df = 198, p-value = 0.007743
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -30.612744  -4.717711
sample estimates:
mean of x mean of y 
 505.4444  523.1096

Paired sample t-test

The independent sample t-test is called independent sample t-test, because… well… the two samples are assumed to be independent. But often in psychology and linguistics, you want to compare the data from two conditions experienced by the same set of people (i.e., we want to run a within-subject study). That clearly violates the independence assumption. What should we do?

Well, actually, we actually can run an one-sample t-test on a difference between condition! So, for example, assume you run a lexical decision semantic priming experiment (two conditions, related vs. unrelated priming conditions), where each participant experienced the two conditions:

set.seed(1)
n = 50
effect <- rnorm(n,30,100)
baseline <- rnorm(n,800, 200)

df <- data.frame(
  Subject = c(1:n),
  Realted_rt = baseline - effect,
  Unrealted_rt = baseline
)

First, you can simply subtract the unrealted rts from realted rts within each participant:

df$Effect <- df$Realted_rt - df$Unrealted_rt

Then you run a simple one-sample t-test, under the null hypothesis that the priming effect is 0ms!

mean = mean(df$Effect)
mu = 0
sd_sample = sd(df$Effect)
SE = sd_sample/sqrt(length(df$Effect))

t = (mean-mu)/SE

Now we can compute the p-value:

pt(t, length(df$Effect) - 1, lower.tail = TRUE) + pt(-t, length(df$Effect) - 1, lower.tail = FALSE)

[1] 0.001325081

Let’s check if the results agree with the output of the pre-built function:

t.test(df$Effect, mu = 0)


    One Sample t-test

data:  df$Effect
t = -3.4058, df = 49, p-value = 0.001325
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -63.67278 -16.41687
sample estimates:
mean of x 
-40.04483

Viola!