22 sep 2016

Inhoud

T-distribution

Gosset

In probability and statistics, Student's t-distribution (or simply the t-distribution) is any member of a family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown.

In the English-language literature it takes its name from William Sealy Gosset's 1908 paper in Biometrika under the pseudonym "Student". Gosset worked at the Guinness Brewery in Dublin, Ireland, and was interested in the problems of small samples, for example the chemical properties of barley where sample sizes might be as low as 3.

Source: Wikipedia

Population distribution

layout(matrix(c(2:6,1,1,7:8,1,1,9:13), 4, 4))

n  = 56    # Sample size
df = n - 1 # Degrees of freedom

mu    = 100
sigma = 15

IQ = seq(mu-45, mu+45, 1)

par(mar=c(4,2,2,0))  
plot(IQ, dnorm(IQ, mean = mu, sd = sigma), type='l', col="red", main = "Population Distribution")

n.samples = 12

for(i in 1:n.samples) {
  
  par(mar=c(2,2,2,0))  
  hist(rnorm(n, mu, sigma), main="Sample Distribution", cex.axis=.5, col="red", cex.main = .75)
  
}

T-statistic

\[T_{n-1} = \frac{\bar{x}-\mu}{SE_x} = \frac{\bar{x}-\mu}{s_x / \sqrt{n}}\]

So the t-statistic represents the deviation of the sample mean \(\bar{x}\) from the population mean \(\mu\), considering the sample size, expressed as the degrees of freedom \(df = n - 1\)

A sample

Let's take one sample from our normal populatiion and calculate the t-value.

x = rnorm(n, mu, sigma); x
##  [1]  84.65045 115.68970 115.56559 118.94881  89.41771  87.01272  98.71484
##  [8] 108.44501 107.60579 115.05398 105.53676 115.73904  88.15301 108.11327
## [15]  98.25847 101.20466  98.78177 103.56141 107.60687  85.57403  85.46381
## [22] 105.29740  96.39482 113.07688 112.61159  79.04625 108.83989 124.89027
## [29] 116.83447 130.73631  96.46831  77.63572 116.53323  96.38925 111.45964
## [36]  87.27848 118.08838 100.21642  99.94128 108.39469 108.77982 113.58376
## [43] 114.14878 120.79860  76.82810 113.91542  88.84646 102.49831 110.85571
## [50]  96.98566  85.15733 122.05964  86.80974  66.37649 120.68288 104.11393
hist(x, main = "Sample distribution", col = "red")

mean(x)
## [1] 103.0656

t-value

\[T_{n-1} = \frac{\bar{x}-\mu}{SE_x} = \frac{\bar{x}-\mu}{s_x / \sqrt{n}}\]

t = (mean(x) - mu) / (sd(x) / sqrt(n)); t
## [1] 1.641664

More samples

let's take more samples.

n.samples     = 1000
mean.x.values = vector()
se.x.values   = vector()

for(i in 1:n.samples) {
  x = rnorm(n, mu, sigma)
  mean.x.values[i] = mean(x)
  se.x.values[i]   = (sd(x) / sqrt(n))
}

Mean and SE for all samples

head(cbind(mean.x.values, se.x.values))
##      mean.x.values se.x.values
## [1,]      98.49146    2.266740
## [2,]     101.05113    2.216187
## [3,]      99.18157    1.964371
## [4,]      99.24282    1.990593
## [5,]      99.47378    2.373609
## [6,]      99.43591    1.619040

Samples distribution

hist(mean.x.values, 
     col  = "red", 
     main = "Samples distribution", 
     xlab = "all sample means")

Calculate t-values

\[T_{n-1} = \frac{\bar{x}-\mu}{SE_x} = \frac{\bar{x}-\mu}{s_x / \sqrt{n}}\]

t.values = (mean.x.values - mu) / se.x.values

tail(cbind(mean.x.values, mu, se.x.values, t.values))
##         mean.x.values  mu se.x.values   t.values
##  [995,]     101.69894 100    2.013191  0.8439018
##  [996,]      98.13298 100    2.305512 -0.8098086
##  [997,]     100.83925 100    2.415259  0.3474801
##  [998,]      99.32766 100    1.904915 -0.3529493
##  [999,]      98.98374 100    1.759196 -0.5776872
## [1000,]      98.95187 100    2.062893 -0.5080890

Sampled t-values

What is the distribution of all these t-values?

hist(t.values, 
     freq = F, 
     main = "Sampled T-values", 
     xlab = "T-values", 
     ylim = c(0, .4))
T = seq(-4, 4, .01)
lines(T, dt(T,df), col = "red")
legend("topright", lty = 1, col="red", legend = "T-distribution")

T-distribution

So if the population is normaly distributed (assumption of normality) the t-distribution represents the deviation of sample means from the population mean (\(\mu\)), given a certain sample size (\(df = n - 1\)).

The t-distibution therefore is different for different sample sizes and converges to a standard normal distribution if sample size is large enough.

The t-distribution is defined by:

\[\textstyle\frac{\Gamma \left(\frac{\nu+1}{2} \right)} {\sqrt{\nu\pi}\,\Gamma \left(\frac{\nu}{2} \right)} \left(1+\frac{x^2}{\nu} \right)^{-\frac{\nu+1}{2}}\!\]

where \(\nu\) is the number of degrees of freedom and \(\Gamma\) is the gamma function.

Source: wikipedia

One or two sided

Two sided

  • \(H_A: \bar{x} \neq \mu\)

One sided

  • \(H_A: \bar{x} > \mu\)
  • \(H_A: \bar{x} < \mu\)

Effect-size

The effect-size is the standardised difference between the mean and the expected \(\mu\). In the t-test effect-size is expressed as \(r\).

\[r = \sqrt{\frac{t^2}{t^2 + \text{df}}}\]

r = sqrt(t^2/(t^2 + df))

r
## [1] 0.2603778

Effect-size distribution

We can also calculate effect-sizes for all our calculated t-values. Under the assumption of \(H_0\) the effect-size distribution looks like this.

r = sqrt(t.values^2/(t.values^2 + df))

tail(cbind(mean.x.values, mu, se.x.values, t.values, r))
##         mean.x.values  mu se.x.values   t.values          r
##  [995,]     101.69894 100    2.013191  0.8439018 0.11306205
##  [996,]      98.13298 100    2.305512 -0.8098086 0.10854935
##  [997,]     100.83925 100    2.415259  0.3474801 0.04680287
##  [998,]      99.32766 100    1.904915 -0.3529493 0.04753787
##  [999,]      98.98374 100    1.759196 -0.5776872 0.07766007
## [1000,]      98.95187 100    2.062893 -0.5080890 0.06835048
hist(r, main = "effect-size distribution", col = "red")

Cohen (1988)

  • Small: 0 <= .1
  • Medium: .1 <= .3
  • Large: .3 <= .5

One-sample t-test

Compare sample mean

We use the one-sample t-test to compare the sample mean \(\bar{x}\) to the population mean \(\mu\).

Let's take a different sample of n = 43 and calculate the mean of this sample.

mu = 120
n  = length(IQ.van.je.buur)
x  = IQ.van.je.buur

mean_x = mean(x, na.rm = T); mean_x
## [1] 120.8893
sd_x   = sd(x, na.rm = T); sd_x
## [1] 13.61314

Does this mean differ significantly from the population mean \(\mu = 120\)?

Hypothesis

Null hypothesis

  • \(H_0: \bar{x} = \mu\)

Alternative hypothesis

  • \(H_A: \bar{x} \neq \mu\)
  • \(H_A: \bar{x} > \mu\)
  • \(H_A: \bar{x} < \mu\)

T-statistic

\[T_{n-1} = \frac{\bar{x}-\mu}{SE_x} = \frac{\bar{x}-\mu}{s_x / \sqrt{n}} = \frac{120.89 - 120 }{13.61 / \sqrt{299}}\]

So the t-statistic represents the deviation of the sample mean \(\bar{x}\) from the population mean \(\mu\), considering the sample size.

t = (mean_x - mu) / (sd_x / sqrt(n)); t
## [1] 1.129553

Type 1 error

To determine if this t-value significantly differs from the population mean we have to specify a type I error that we are willing to make.

  • Type I error / \(\alpha\) = .05

P-value one sided

Finally we have to calculate our p-value for which we need the degrees of freedom \(df = n - 1\) to determine the shape of the t-distribution.

df = n - 1; df
## [1] 298
if(!"visualize" %in% installed.packages()) { install.packages("visualize") }
library("visualize")

visualize.t(t, df, section = "upper")

P-value two sided

visualize.t(c(-t, t), df, section = "tails")

Effect-size

\[r = \sqrt{\frac{t^2}{t^2 + \text{df}}}\]

r = sqrt(t^2/(t^2 + df))

r
## [1] 0.06529365

END