Student’s t-test

One sample t-test

One sample t test is to test the null hypothesis that the population mean is not different from a specified value. This can be done by using t.test() in R.

To learn how to perform t-test in R, we will use another data set that describe the violent crime rates and concealed-carry laws for 50 US States plus DC. This is a balanced panel of data on 50 US states plus the District of Columbia (for a total of 51 ‘states’) by year for 1977–1999. Each observation is a given state in a given year. There are a total of 51 states times 23 years = 1173 observations. These data were provided by Prof. John Donohue of Stanford University and were used in his paper with Ian Ayres “Shooting Down the ’More Guns Less Crime Hypothesis” (Stanford Law Review 2003)

At this stage, you should be able to read-in data and subset part of the whole data now. For example, we can use one sample t-test to test if the murder rate in Michigan state significantly differed from a certain value, say 9 incidents per 10^5 people. The syntax is t.test(sample, mu="specified value").

You can either excute the following code to download the data to your local machine (it will be a dataframe named “guns”).

library(RCurl)
guns=read.csv(text=getURL("https://raw.githubusercontent.com/OscarFHC/NRE538_GSRA/master/Labs/NRE538_Lab2/guns.csv"), sep=",", header=T, comment.char="#")

Or, you can download the file to your local machine and read it with the following code.

setwd("D:/Courses/UM/2016_WN/NRE538_GSRA/Labs/NRE538_Lab2")
  # set working directory
guns = read.table("guns.csv",header=T,fill=T,sep=",")
  # read in data

Let’s perform the t-test.

guns.MI = subset(guns, state=="Michigan")
t.test(guns.MI[,"murder"], mu=9)
## 
##  One Sample t-test
## 
## data:  guns.MI[, "murder"]
## t = 2.3794, df = 22, p-value = 0.02644
## alternative hypothesis: true mean is not equal to 9
## 95 percent confidence interval:
##   9.085973 10.253157
## sample estimates:
## mean of x 
##  9.669565

So, what do these statistical results tell you?
Let’s plot the histogram of the murder rate of Michigan state to visualize the results.

hist(guns.MI[,"murder"], breaks=20)
abline(v=mean(guns.MI[,"murder"]), col="blue")
abline(v=9, col="dark green")
  legend("topleft", legend = c("sample mean", "specified value=9"),text.col=c("blue", "dark green"))


Exercise

Perform t-test on other statistics of Michigan state and see if those statistics significantly differed from a specified values that is specified by yourself.


Two sample t-test

Two sample t-test is to test the hypothesis that the two population means are not significantly differed from each other. Two sample t-test can be categorized into paired and unpaired t-test.
Remember what is the difference between the two?

Let’s compare the murder rate of Washington state and Minnesota state. The syntax is t.test(sample1, sample2, paired=TRUE or FALSE).

guns.WAS = subset(guns, state=="Washington")
guns.MIN = subset(guns, state=="Minnesota")
t.test(guns.WAS[,"murder"], guns.MIN[,"murder"], paired=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  guns.WAS[, "murder"] and guns.MIN[, "murder"]
## t = 12.104, df = 43.616, p-value = 1.546e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.753891 2.454804
## sample estimates:
## mean of x mean of y 
##  4.773913  2.669565
  • What does the results tell you?
  • why do I use paired=FALSE here?

Let’s visualize the results by plotting the two distributions.

hist(guns.WAS[,"murder"], breaks=15, col="light blue", xlim=c(0, 7), ylim=c(0,5), xlab="murder rate", main="")
abline(v=mean(guns.WAS[,"murder"]), col="blue")
par(new=TRUE)
hist(guns.MIN[,"murder"], breaks=15, col="light green", xlim=c(0, 7), ylim=c(0,5), xlab="", main="")
abline(v=mean(guns.MIN[,"murder"]), col="dark green")
  legend("topright", legend = c("Washington mean", "Minnesota mean"),text.col=c("blue", "dark green"))

Rivisit the assumptions of t-test

Three critical assumptions of t-test are:
1. samples have equal variance
2. samples are normally distributed
3. each observation is sampled independently
Among the three, the third one is unlikely to test, but the first two are testable. Let’s see if the murder rates of Washington state and Minnesota state meet the first two assumptions.

To test for normality, we can just use the shapiro.test(), and to test for equal variance, we can use var.test(), which is essentially a F-test.

shapiro.test(guns.WAS[,"murder"])
## 
##  Shapiro-Wilk normality test
## 
## data:  guns.WAS[, "murder"]
## W = 0.94148, p-value = 0.1933
shapiro.test(guns.MIN[,"murder"])
## 
##  Shapiro-Wilk normality test
## 
## data:  guns.MIN[, "murder"]
## W = 0.98013, p-value = 0.9083
var.test(guns.WAS[,"murder"], guns.MIN[,"murder"])
## 
##  F test to compare two variances
## 
## data:  guns.WAS[, "murder"] and guns.MIN[, "murder"]
## F = 1.2072, num df = 22, denom df = 22, p-value = 0.6626
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.5119867 2.8464432
## sample estimates:
## ratio of variances 
##           1.207204

All assumptions are met, so the results from this unpaired t-test should be robust. But, what if the assumptions are violated?

standard error (SE) and confidence interval (CI)

Let me first introduce you two important statistics.
- standard error (SE): standard deviation of a statistic (e.g. mean). This is very different from the standard deviation of sample!
- confidence interval (CI): the interval within which a statistic is expected to be observed. For example, “95% confidence interval of a statistic is X to Y” means that we are 95% confident that the true value of the statistic is in X to Y. Upper/lower limit of 95% CI of a statistic is the mean of that statistic plus/minus 1.96 SE. Why?

Here I create a simple 30-measurement sample A to demonstrate you what the standard error and confidence interval look like and how to calculate them…
First, we generate a simple 30-measurement sample, sample A.

muA=4
sdA=1
set.seed(32)
A = rnorm(30, mean=muA, sd=sdA)
hist(A,breaks=30,col=rgb(1,0,0,0.5),xlim=c(0,10), ylim=c(0,10), xlab="Value")
box()

There are couple ways to calculate the standard error and confidence interval of the mean.

  1. Use the central limit theorem and assume the sample follows a normal distribution. When the sample follows a normal distribution, standard error of mean can be calculated as \(\frac{SD}{n}\). SD is the standard deviation and n is sample size.
## [1] "standard error calculated base on the central limit theorm = 0.131"
## [1] "95% confidence interval base on the central limit theorm : 3.947 - 4.459"
  1. By resampling technique. This method is great as it is applicable to all kinds of sample distributions.
## [1] "standard error calculated by resampling = 0.13"
## [1] "95% confidence interval by resampling : 3.934 - 4.444"

How do they look on a figure?

With the two statistics in mind, we can dig into the rational of comparing the mean of two sample in more detail.

Important! To test if two population means are significantly different from each other at 5% significance level (the \(\alpha\) value determined by us), we need to know if the 95% confidence level of the difference between the two population means covers 0. If yes, then the two population means are significantly different at 5% significance level. There are multiple ways to calculate the the 95% confidence level.

One sample comparison

Let’s test if the mean of sample A significantly different from 4.5 as an example. We already know that the 95% confidence interval of sample A is 3.937 - 4.442, which does not cover 4.5, so the mean of sample A is significantly different from 4.5 at 5% confidence interval.

xs = seq(0, 15, by =0.01)
fA = dnorm(xs, mean=A.mean.CLT, sd=A.mean.SE.CLT)
plot(xs, fA, type="l", lwd=3, col="blue", xlim=c(2, 6), ylim=c(0, 3), 
     main="distribution of means", xlab="value", ylab="density")
abline(v=c(qnorm(0.025, mean=A.mean.CLT, sd=A.mean.SE.CLT), qnorm(0.975, mean=A.mean.CLT, sd=A.mean.SE.CLT)), col="blue", lty="dashed")
abline(v=4.5, col="dark green", lty="solid", lwd=3)

In addition, we can also calculate the probability of observing a values that is “greater” than 4.5 in the distribution of the mean sample A. This probability the probability to NOT reject null hypothesis, which is the mean of sample A is not less, i.e. equal or greater, than 4.5. If this probability is less than 0.05, we reject the hypothesis at 5% confidence level.

The probability of observing a value that is greater than 4.5 in the distribution of means of sample A is:

1-pnorm(4.5, mean=A.mean.resamp, sd=A.mean.SE.resamp)
## [1] 0.01177523
  • Note that this is a one-tail comparison because I already know that the mean of sample A is less than 4.5. The null hypothesis is thus “the mean of sample A is not less, i.e. equal or greater, than 4.5”. If it’s two tail test, we’ll have to times the probability by 2 as we will be dealing with both “greater” and “less” issues.

Two sample comparison

The idea of two sample comparison is still the same as one sample comparison. Therefore, we now have to calculate the 95% confidence interval and see if this 95% confidence interval covers 0.
We first create another sample B, which also has 30 measurements and plot the two histogram together.

muA=4
muB=4.5
sdA=1
sdB=1
set.seed(32)
A = rnorm(30, mean=muA, sd=sdA)
B = rnorm(30, mean=muB, sd=sdB)
hist(A,breaks=30,col=rgb(1,0,0,0.5),xlim=c(0,10), ylim=c(0,10), xlab="Value", main="histogram of sample A and B")
hist(B,breaks=30,col=rgb(0,1,0,0.5),add=TRUE)
box()

According to the central limit theorem, the difference between the two sample means should follow a normal distribution. We can calculate the 95% confidence interval accordingly.

dif.mean=mean(A)-mean(B)
dif.SE=sqrt((var(A)+var(B))/length(A))
dif.CI=c(dif.mean-1.96*dif.SE, dif.mean+1.96*dif.SE)
print(paste0("standard error calculated base on the central limit theorm = ", dif.SE))
## [1] "standard error calculated base on the central limit theorm = 0.200231209755627"
print(paste0("95% confidence interval base on the central limit theorm : ", 
             dif.mean-1.96*dif.SE, " - ", dif.mean+1.96*dif.SE))
## [1] "95% confidence interval base on the central limit theorm : -0.50282247902019 - 0.282083863221867"

The distribution of differences of sample means and the location of zero should look like

xs = seq(-1, 1, by =0.01)
fA = dnorm(xs, mean=dif.mean, sd=dif.SE)
plot(xs, fA, type="l", lwd=3, col="blue", xlim=c(-1, 1), ylim=c(0, 3), 
     main="distribution of sample mean difference", xlab="value", ylab="density")
abline(v=c(qnorm(0.025, mean=dif.mean, sd=dif.SE), qnorm(0.975, mean=dif.mean, sd=dif.SE)), col="blue", lty="dashed")
abline(v=0, col="dark green", lty="solid", lwd=3)

legend("topright", legend = c("95% confidence interval", 
                              "0"), 
       col=c("blue", "dark green"),
       text.col=c("blue", "dark green"),
       lty=c("dashed", "solid"), bty="n")

The 95% confidence interval is -0.523 - 0.282, which covers 0. So, we can not reject the null hypothesis, which is the two sample means do not differ from each other at the 5% significance level.

Again, we calculate the probability to observe a value that is “greater or less” than 0 in the distribution of differences. When this probability is less than 0.05 (the \(/alpha\) value), we reject the null hypothesis.

2*(1-pnorm(0, mean=dif.mean, sd=dif.SE))
## [1] 0.5814902
  • Notice that I times 2 here. Can you explain why?

The calculations used in the two examples I introduced to you are based on the assumption that samples are normally distributed and the central limit theorem. Under the same assumption, “Student” developed another method to compare two samples without really calculate the confidence interval of mean — the Student’s t-test. In the two-sample Student’s t-test (two sample comparison case), the difference between the means of sample A and B, when scaled on the standard error of mean of population A (can also be the standard error of mean of population B), should follow a t-distribution. The t-distribution largely resembles a normal distribution, especially when the sample size is sufficiently large. However, the exact shape of this t-distribution is determined by the sample size (n) of population A (degree of freedom = n-1). In addition, given the means of two populations and the corresponding sample size, we can calculate a t-value (the “t” reported in the output of t.test()). With the t-distribution and the t-value calculated from the data, we can calculate the probability of observing a t-value that is greater or less than the one being calculated from the data in this t-distribution. This probability is the probability of rejecting the null hypothesis, or the probability of the two sample having different means.

Let’s perform t-test on sample A and B and see if we can derive the same conclusions.

t.test(A, B, paired = FALSE, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  A and B
## t = -0.55121, df = 58, p-value = 0.5836
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5111756  0.2904370
## sample estimates:
## mean of x mean of y 
##  4.203129  4.313499

Notice that the p-value=0.5836 in the t-test is really close to the one we calculated, 0.5814. The conclusions are the same.


Exercise

  • Can you try to compute the one sample case?

Couple important things should be mentioned here…
1. Note that the exact distribution of t-distribution depends on the sample size. When having large sample size, the distribution would become much narrower. This will make it easier for the two population to be significantly different from each other. This is partly why we often say that when having more data point, it becomes more easily to find significant results. This is also the issue of power analysis.

  1. What if we alter the significance level from 0.05 to 0.1 or 0.01? By doing so, we change the probability of Type I and II error.

  1. When the samples are not normally distributed, these t statistics should start to fall apart. That being said, when the assumptions of t-test are violated, we cannot rely on the t-distribution to compare the two samples.

non equal variance

Let’s violate the first assumption, equal variance, and see what would happen…

In the following code, I first write a function to calculate the t-statistic when comparing two samples. I also calculate what the theoretical t-value at the 0.05 significance level is when comparing two normally distributed samples with equal variance. I then run 1000 simulations. In each simulation, I generate a pair of 10-measurement samples independently from two normal distributions 10000 times. One normal distribution has mean equals to 0 and standard deviation equals to 1 and the other one has mean equals to 0 but standard deviation equals to 10. At each of these 10000 times, I calculate the t-value when comparing the two samples and save it. Just by chance, the two samples might have different means (so render large t-values), even thou the pair are generated from normal distributions with same mean (the null hypothesis should always be true). In each simulation, I can calculate how many t-values exceed the theoretical t-value. The probability to observe a t-value that is greater than the theoretical one should be 0.05, if all assumptions are met. This probability is the probability of making type I error, which mean we reject the null hypothesis when the null hypothesis is true.
Let’s see how would unequal variance affect the probability of making type I error.

t_stat = function(x, y) { # generates values of t-test statistic
  
  n = length(x)
  m = length(y)
  
  # compute pooled standard error   
  sp = sqrt(((n-1)*sd(x)^2 + (m-1)*sd(y)^2)/(n+m-2)*(n+m)/(n*m))
  
  # compute test statistic
  (mean(x)-mean(y))/sp
}

n = 10
m = 10
crit = c(qt(0.025, n+m-2), qt(0.975, n+m-2))

nsim = 10000 # number of repetitions for ONE Monte Carlo simulation,
              # produces ONE estimate of the probability of a type 1 error.
nrep = 100 # number of repetitions of Monte Carlo simulation of length nsim,
            # needed to compute several (nrep) estimates of the probability of type 1 error

val = matrix(NA, ncol=1, nrow=nsim)
prob = matrix(NA, ncol=1, nrow=nrep)

# Monte Carlo Simulation
set.seed(1032)

for (i in 1:nrep) {
  for (j in 1:nsim) {
    
    x = rnorm(n, 0, 1)
    y = rnorm(m, 0, 10)
    
    # values of t-test statistic
    val[j] = t_stat(x,y)
  }
  
  # probability of type 1 error
  prob[i] = length(val[val> crit[2] | val< crit[1]])/ length(val)
}
p = round(mean(prob),3)
se = round(sd(prob)/sqrt(nrep),7)
#print(paste0("probability of making type I error = ", p));
#print(paste0("Standard error of the probability of making type I error = ", se))

We see that the probability of making type I error is 0.064 ± 2.48810^{-4}. It increases from 5% to more than 6%.
What should we do then? There is another test called Welch’s t-test to deal this issue and it’s built in the t.test() function. Just specify the var.equal = FALSE when using the t.test(). Or, resampling technique is always your good friend!


Exercise

Use the code above to see if the standard deviation of one sample is 100.


non-normal distribution

Let’s violate the second assumption, normal distribution, with the same method and see what would happen…

Let’s compare a normal distribution to a gamma distribution, which is normally a right skewed distribution. Can you plot the following histogram showing the gamma distribution?

Here, I set the mean of gamma distribution to be the same with the normal distribution, so that the null hypothesis should still always be true.

t_stat = function(x, y) { # generates values of t-test statistic
  
  n =length(x)
  m = length(y)
  
  # compute pooled standard error   
  sp = sqrt(((n-1)*sd(x)^2 + (m-1)*sd(y)^2)/(n+m-2)*(n+m)/(n*m))
  
  # compute test statistic
  (mean(x)-mean(y))/sp
}

n = 10
m = 10
crit = c(qt(0.025, n+m-2), qt(0.975, n+m-2))

nsim = 1000 # number of repetitions for ONE Monte Carlo simulation,
              # produces ONE estimate of the probability of a type 1 error.
nrep = 100 # number of repetitions of Monte Carlo simulation of length nsim,
            # needed to compute several (nrep) estimates of the probability of type 1 error

val = matrix(NA, ncol=1, nrow=nsim)
prob = matrix(NA, ncol=1, nrow=nrep)

# Monte Carlo Simulation
set.seed(1302)

for (i in 1:nrep) {
  for (j in 1:nsim) {
    
    x = rnorm(n, 1, 1)
    y = rgamma(m, shape=5, scale=1/5)

    # values of t-test statistic
    val[j] = t_stat(x,y)
  }
  
  # probability of type 1 error
  prob[i] = length(val[val> crit[2] | val< crit[1]])/ length(val)
}
p = round(mean(prob),3)
se = round(sd(prob)/sqrt(nrep),7)
p
## [1] 0.056
se
## [1] 0.0007364
#print(paste0("probability of making type I error = ", p));
#print(paste0("Standard error of the probability of making type I error = ", se))

We see that the probability of making type I error becomes 0.056 ± 7.36410^{-4}. It also increase the probability of making Type I error.
Well, to my knowledge, t.test() does not have a decent way to deal with this issue…Maybe it’s time to really learn how to do resampling technique.


Exercise

Use the code above to see how the probability would change when comparing a normal distribution to a uniformrunif() distribution.