22 sep 2016
![]()
In probability and statistics, Student's t-distribution (or simply the t-distribution) is any member of a family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown.
In the English-language literature it takes its name from William Sealy Gosset's 1908 paper in Biometrika under the pseudonym "Student". Gosset worked at the Guinness Brewery in Dublin, Ireland, and was interested in the problems of small samples, for example the chemical properties of barley where sample sizes might be as low as 3.
Source: Wikipedia
layout(matrix(c(2:6,1,1,7:8,1,1,9:13), 4, 4))
n = 56 # Sample size
df = n - 1 # Degrees of freedom
mu = 100
sigma = 15
IQ = seq(mu-45, mu+45, 1)
par(mar=c(4,2,2,0))
plot(IQ, dnorm(IQ, mean = mu, sd = sigma), type='l', col="red", main = "Population Distribution")
n.samples = 12
for(i in 1:n.samples) {
par(mar=c(2,2,2,0))
hist(rnorm(n, mu, sigma), main="Sample Distribution", cex.axis=.5, col="red", cex.main = .75)
}
\[T_{n-1} = \frac{\bar{x}-\mu}{SE_x} = \frac{\bar{x}-\mu}{s_x / \sqrt{n}}\]
So the t-statistic represents the deviation of the sample mean \(\bar{x}\) from the population mean \(\mu\), considering the sample size, expressed as the degrees of freedom \(df = n - 1\)
Let's take one sample from our normal populatiion and calculate the t-value.
x = rnorm(n, mu, sigma); x
## [1] 84.65045 115.68970 115.56559 118.94881 89.41771 87.01272 98.71484 ## [8] 108.44501 107.60579 115.05398 105.53676 115.73904 88.15301 108.11327 ## [15] 98.25847 101.20466 98.78177 103.56141 107.60687 85.57403 85.46381 ## [22] 105.29740 96.39482 113.07688 112.61159 79.04625 108.83989 124.89027 ## [29] 116.83447 130.73631 96.46831 77.63572 116.53323 96.38925 111.45964 ## [36] 87.27848 118.08838 100.21642 99.94128 108.39469 108.77982 113.58376 ## [43] 114.14878 120.79860 76.82810 113.91542 88.84646 102.49831 110.85571 ## [50] 96.98566 85.15733 122.05964 86.80974 66.37649 120.68288 104.11393
hist(x, main = "Sample distribution", col = "red")
mean(x)
## [1] 103.0656
\[T_{n-1} = \frac{\bar{x}-\mu}{SE_x} = \frac{\bar{x}-\mu}{s_x / \sqrt{n}}\]
t = (mean(x) - mu) / (sd(x) / sqrt(n)); t
## [1] 1.641664
let's take more samples.
n.samples = 1000
mean.x.values = vector()
se.x.values = vector()
for(i in 1:n.samples) {
x = rnorm(n, mu, sigma)
mean.x.values[i] = mean(x)
se.x.values[i] = (sd(x) / sqrt(n))
}
head(cbind(mean.x.values, se.x.values))
## mean.x.values se.x.values ## [1,] 98.49146 2.266740 ## [2,] 101.05113 2.216187 ## [3,] 99.18157 1.964371 ## [4,] 99.24282 1.990593 ## [5,] 99.47378 2.373609 ## [6,] 99.43591 1.619040
hist(mean.x.values,
col = "red",
main = "Samples distribution",
xlab = "all sample means")
\[T_{n-1} = \frac{\bar{x}-\mu}{SE_x} = \frac{\bar{x}-\mu}{s_x / \sqrt{n}}\]
t.values = (mean.x.values - mu) / se.x.values tail(cbind(mean.x.values, mu, se.x.values, t.values))
## mean.x.values mu se.x.values t.values ## [995,] 101.69894 100 2.013191 0.8439018 ## [996,] 98.13298 100 2.305512 -0.8098086 ## [997,] 100.83925 100 2.415259 0.3474801 ## [998,] 99.32766 100 1.904915 -0.3529493 ## [999,] 98.98374 100 1.759196 -0.5776872 ## [1000,] 98.95187 100 2.062893 -0.5080890
What is the distribution of all these t-values?
hist(t.values,
freq = F,
main = "Sampled T-values",
xlab = "T-values",
ylim = c(0, .4))
T = seq(-4, 4, .01)
lines(T, dt(T,df), col = "red")
legend("topright", lty = 1, col="red", legend = "T-distribution")
So if the population is normaly distributed (assumption of normality) the t-distribution represents the deviation of sample means from the population mean (\(\mu\)), given a certain sample size (\(df = n - 1\)).
The t-distibution therefore is different for different sample sizes and converges to a standard normal distribution if sample size is large enough.
The t-distribution is defined by:
\[\textstyle\frac{\Gamma \left(\frac{\nu+1}{2} \right)} {\sqrt{\nu\pi}\,\Gamma \left(\frac{\nu}{2} \right)} \left(1+\frac{x^2}{\nu} \right)^{-\frac{\nu+1}{2}}\!\]
where \(\nu\) is the number of degrees of freedom and \(\Gamma\) is the gamma function.
Source: wikipedia
Two sided
One sided
The effect-size is the standardised difference between the mean and the expected \(\mu\). In the t-test effect-size is expressed as \(r\).
\[r = \sqrt{\frac{t^2}{t^2 + \text{df}}}\]
r = sqrt(t^2/(t^2 + df)) r
## [1] 0.2603778
We can also calculate effect-sizes for all our calculated t-values. Under the assumption of \(H_0\) the effect-size distribution looks like this.
r = sqrt(t.values^2/(t.values^2 + df)) tail(cbind(mean.x.values, mu, se.x.values, t.values, r))
## mean.x.values mu se.x.values t.values r ## [995,] 101.69894 100 2.013191 0.8439018 0.11306205 ## [996,] 98.13298 100 2.305512 -0.8098086 0.10854935 ## [997,] 100.83925 100 2.415259 0.3474801 0.04680287 ## [998,] 99.32766 100 1.904915 -0.3529493 0.04753787 ## [999,] 98.98374 100 1.759196 -0.5776872 0.07766007 ## [1000,] 98.95187 100 2.062893 -0.5080890 0.06835048
hist(r, main = "effect-size distribution", col = "red")
Cohen (1988)
We use the one-sample t-test to compare the sample mean \(\bar{x}\) to the population mean \(\mu\).
Let's take a different sample of n = 43 and calculate the mean of this sample.
mu = 120 n = length(IQ.van.je.buur) x = IQ.van.je.buur mean_x = mean(x, na.rm = T); mean_x
## [1] 120.8893
sd_x = sd(x, na.rm = T); sd_x
## [1] 13.61314
Does this mean differ significantly from the population mean \(\mu = 120\)?
\[T_{n-1} = \frac{\bar{x}-\mu}{SE_x} = \frac{\bar{x}-\mu}{s_x / \sqrt{n}} = \frac{120.89 - 120 }{13.61 / \sqrt{299}}\]
So the t-statistic represents the deviation of the sample mean \(\bar{x}\) from the population mean \(\mu\), considering the sample size.
t = (mean_x - mu) / (sd_x / sqrt(n)); t
## [1] 1.129553
To determine if this t-value significantly differs from the population mean we have to specify a type I error that we are willing to make.
Finally we have to calculate our p-value for which we need the degrees of freedom \(df = n - 1\) to determine the shape of the t-distribution.
df = n - 1; df
## [1] 298
if(!"visualize" %in% installed.packages()) { install.packages("visualize") }
library("visualize")
visualize.t(t, df, section = "upper")
visualize.t(c(-t, t), df, section = "tails")
\[r = \sqrt{\frac{t^2}{t^2 + \text{df}}}\]
r = sqrt(t^2/(t^2 + df)) r
## [1] 0.06529365