NHST: The frequentist treats the hypothesis as fixed — it’s either true or false. The sampling distribution tells us the likelihood of obtaining data like ours (or more extreme) if the hypothesis were true. We don’t attach a probability to the truth of the hypothesis itself.
Statistical processes produce samples from an underlying probability mass function or probability density function, and frequentists model real-world phenomena as if they arise from such processes. In practice, we don’t observe these full distributions directly — instead, we infer them from randomly drawn samples. This is the foundation of statistical inference - a practice which does not fully align with the scientific method. One core method in this framework is null hypothesis significance testing (NHST), which provides a structured way to test whether the distribution we suspect is driving the data could be wrong. It allows us to make probabilistic statements about how surprising our observed data would be if the null hypothesis were true — thereby quantifying the evidence against that hypothesis.
NHST is often criticised for primarily detecting large, obvious effects and for its limitations in assessing nuanced hypotheses. However, when thoughtfully designed with adequate power, precise hypotheses, and appropriate controls, NHST can be a robust tool for rigorous scientific inference —shiftingits role from merely flagging extreme failures to genuinely testing theory-driven predictions.
NHST provides a quantitative framework/justification for determining whether or not a hypothesis can still stand in the face of data (e.g. is a coin consigered fair given a skewed distribution of heads and tails). It is a frequentist technique based solely on the likelihood with the chosen null/alternative hypothesis being highly subjective (like the priors of Bayesian statistics). The Neyman-Pearson approach helps determine when the null hypothesis should be rejected based on the significance of the test. This means deciding how much risk we’re willing to accept when we incorrectly reject the null hypothesis, given that the observed data is unlikely under the null hypothesis.
Comparison of a High Power and a Low Power Test
Note
All computations in frequentist statistics involve the likelihood function \(P(x|\theta)\), which plays a central role in parameter estimation and hypothesis testing. In frequentist inference, we do not assign probabilities to hypotheses. Specifically \(P(H_0)\) and \(P(H_0|x)\) are undefined and instead, inference is made by assuming \(H_0\) is true and evaluating how likely the observed data \(x\) is under this assumption: \[\boxed{\text{If } P(x|H_0)\text{ is small} \Rightarrow \text{Reject } H_0}\quad\text{(frequentist logic)}\] The likelihood function\(\mathcal{L}(\theta|x) = P(x|\theta)\) is not a probability distribution over parameters, but a function of how plausible different values of \(\theta\) are given fixed observed data \(x\), specifically;
Low likelihood under \(H_0\) - likely small p-value - high chance of rejection
High likelihood under \(H_0\) - likely large p-value - low chance of rejection
Contrast this with the Bayesian expression in which probabilities are assigned to the prior and posterior probabilities: \[\boxed{P(H_0|x) = \frac{P(x|H_0)P(H_0)}{P(x)}}\quad\text{(Bayes' theorem — not used in frequentist inference)}\]
random variable with probability distribution given by the null distribution
null distribution
\(P(T|H_0)\)
the PDF/PMF of test statistic \(T\) under \(H_0\)
p-value
\(p=1-F(t_{obs}|H_0)\)
the cumulative “tail” probability of test statistic \(t\) distribution given \(H_0\) probability of obtaining a test statistic as extreme as or more extreme than the observed value
critical value/threshold
\(\{z_\alpha, t_\alpha\dots\}\)
value of the test statistic that defines the rejection threshold for \(H_0\)
A simple hypothesis completely specifies the probability distribution of the population using fixed/hypothesised parameter values e.g. \(H:\:X\sim\text{Norm}(0,1)\). For the example coin experiment the null hypothesis \(H_0:\:X\sim \text{Bin}(10,0.5)\) is a simple hypothesis.
A composite hypothesis specifies the probability distribution of the population using unknown/flexable parameter values e.g. \(H:\:X\sim\text{Norm}(\mu,\sigma)\) since parameters \(\mu\) and \(\sigma\) are variable the hypothesis is a sort of compound hypothesis rather than a straightforward single hypotheses. For the example coin experiment the alternative hypthesis \(H_0:\:X\sim \text{Bin}(10,p)\quad\text{where}\:p\ne 0.5\) is a composite hypothesis since \(p\) has a range of possible values.
Power: The Actual Test Statistic Distribution
Somewhat paradoxically, we need the actual test statistic distribution to compute power of an experiment. This is solved by estimating the alternative test statistic to compute the power.
estimation method
use case
Effect size estimates
Prior research or pilot studies available
Normal/t-distribution approximation
Large samples or known distributions
Bootstrapping
Small samples, unknown distributions
Monte Carlo simulation
Complex models, no closed formula
Nonparametric methods
No assumptions about distribution
Typically we can increase the power of a test by increasing the amount of data and thereby decreasing the variance of the null and alternative distributions. In experimental design it is important to determine ahead of time the number of trials or subjects needed to achieve a desired power.
Significance (\(\alpha\)-value)
for a composite \(H_0\) the distribution instance which corresponds to the highest significance is the value/distribution we use for the significance level.
Significance analogy in Legal Trials. In criminal law, the standard of proof is “beyond a reasonable doubt,” which is designed to minimise Type I errors (wrongfully convicting an innocent person). This is equivalent to setting a very low significance level (\(\alpha\)), requiring very strong evidence before rejecting the presumption of innocence \(H_0\). However, this does not mean that Type I errors (wrongful convictions) are impossible — only that the legal system is structured to minimise them. The trade-off is that reducing Type I errors increases the probability of Type II errors (failing to convict a guilty person), just as lowering \(\alpha\) in statistics increases the chance of failing to reject a false null hypothesis. The bottom line is this: the legal system implicitly acknowledges that some wrongful convictions will occur since appeals exist—to correct wrongful convictions. We prosecuting there is always a chance of error.
Controlling Power and Significance
%%{init: {'theme':'neutral', 'fontFamily':'Arial', 'flowchart': {'useMaxWidth':false}}}%%
flowchart LR
A["α Increase"] --> B["Power (1-β)"]
A --> C["Type I Error Increase"]
D["Sample Size Increase"] --> B
E["Effect Size Increase"] --> B
F["Variability Reduction (data std.dev.)"] --> B
linkStyle 0,1 stroke:#ff6b6b,stroke-width:2px
linkStyle 2,3,4 stroke:#4dabf7,stroke-width:2px
class A,C critical;
class B success;
classDef critical fill:#ff8787,stroke:#fa5252
classDef success fill:#69db7c,stroke:#2b8a3e
Tip
Diagram Legend
Red Nodes: Significance threshold (\(\alpha\)) and Type I Error
Green Node: Statistical power (1-\(\beta\))
Red Arrows: Trade-off (higher \(\alpha\) - more power but more false positives)
Blue Arrows: Power-boosting factors
\(p\)-value - the heart of NHST tests
If the \(p\)-value is calculated by looking up the tests statistic (or even just the random variable value) against whatever CDF/CMF models the null distribution. The \(p\)-value is the probability, assuming the null hypothesis \(H_0\), of observing a result at least as extreme as the one we got. If the \(p\)-value is less than the significance level \(\alpha\) then we reject \(H_0\). Otherwise we fail to reject \(H_0\).
Test Type
Alternative Hypothesis
Area Considered
Summary
Two-tailed
\(H_1:\mu\ne\mu_0\)
Both ends (tails) of the distribution p-value is the smaller CMF value left or right \(\times2\)
Surprise in either direction
One-tailed (left)
\(H_1:\mu<\mu_0\)
Left tail only, p-value is the CMF value is p<dist>(...) or p<dist>(lower.tail = TRUE, ...)
Surprise only if result is too small
One-tailed (right)
\(H_1:\mu>\mu_0\)
Right tail only, p-value is the CMF value is 1-p<dist>(...) or p<dist>(lower.tail = FALSE, ...)
Surprise only if result is too large
one-tailed (right) example
IQ scores are designed to follow a normal distribution \(IQ\sim N(100,15^2)\) with mean \(\mu=100\) and standard deviation \(\sigma=15\). This guy claims to have an IQ of \(173\). How extreme is an IQ of \(173\) under \(N(100,15^2)\)?
\[z=\frac{100-173}{15}=\frac{73}{15}\approx4.87\]
# Probability of getting an IQ ≥ 173 in a standard population1-pnorm(173, mean =100, sd =15)
[1] 5.674811e-07
pnorm(173, mean =100, sd =15, lower.tail =FALSE)
[1] 5.674811e-07
pnorm(73/15, mean =0, sd =1, lower.tail =FALSE)
[1] 5.674811e-07
Probability of randomly finding someone with an IQ that high \(0.0000057\%\). I reject the null hypothesis that this is a true claim.
Common Tests/Test Statistic…
Comparing Means
Student’s t-test - compare means of two unknown sampled normal distributions
The t-test tests the similarity between two groups of samples (each group being drawn from a pdf with normal distribution), it uses the t-statistic, a random variable which follows the Student’s t-distribution. The statistic is low for close matching groups and high for distinct groups. It penalises the difference between the groups (numerator) and rewards variability between the groups (denominator).
the t-test calculates the p-value using a t-statistic against the Student’s t-distribution. This is equivalent to calculating the p-value from the appropriate normal distribution however, here we don’t actually know the \(\mu\) and \(\sigma\) for that distribution so fall back to the t-test.
\[t=\frac{\text{signal}}{\text{noise}}=\frac{\text{difference between group means}}{\text{variability between groups}}=\frac{\bar{x}_1-\bar{x}_2}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}\]
Uses the t-distribution \(t_\nu\) with degrees of freedom \(mu\) based on sample size to determine the probability of observing scenario at least extreme as the one observed.
used when we’re comparing two sample means
the populations are normally distributed (typically a sum of random variables in accordance with the CTL)
the population variance is unknown (so you estimate it from the sample)
There are a few variations of the t-test:
Consider the mtcars dataset. Construct a 95% T interval for MPG comparing 4 to 6 cylinder cars (subtracting in the order of 4 - 6) assume a constant variance.
One-sample t-test
Test whether the estimated mean of normally distributed random sample data differs from a known or hypothesised population mean.
test statistic: \[t=\frac{\bar{x}-\mu_0}{s/\sqrt{n}}\quad\text{where}\:\begin{cases}
s\:\text{is the sample standard deviation} \\
\mu_0\:\text{is the hypothesised population mean}\\
\bar{x}\:\text{is the sample mean}\\
n\:\text{is the sample size}\\
\text{the t-statistic is compared against t-distribution with }n−1\text{ dof}
\end{cases}\]
Here we are getting the p-value for the hypothesised population mean whose cumulative probability distribution corresponds to the sampling distribution of the mean estimator for unknown \(sigma\) i.e. how extreme is the observed data under the null hypothesis?
Is the average miles per gallon of cars equal to \(21\) mpg?
Here I’ll use the mtcars dataset assuming the cars were randomly sampled from the population (it isn’t btw, its actually a convenience sample of specific car models from the early ’70s)
data(mtcars)# Rangom sample of mileage measurementsmpg_values <- mtcars$mpg# Calculation Breakdown (using Student’s t-distribution)n <-length(mpg_values) # no. samplesdf <- n -1# Degrees of freedomx_bar <-mean(mpg_values) # sample means <-sd(mpg_values) # sample std.devmu_0 <-21# hypothesized population meant_stat <- (x_bar - mu_0)/(s/sqrt(n)) # t-statisticp_value <-2*pt(-abs(t_stat), df = df) # Two-tailed p-valueprint(t_stat)
[1] -0.8535335
print(p_value) # ~40% chance of seeing these values under the 21 mpg hypothesis
[1] 0.3999109
The probability of observing a test statistic at least as extreme as \(t=-0.85\) assuming the null hypothesis is true is \(\approx 40\%\). This is far more than \(5%\) so this data is very extreme.
Here’s the quick practical way to do it…
t.test(mpg_values, mu =21)
One Sample t-test
data: mpg_values
t = -0.85353, df = 31, p-value = 0.3999
alternative hypothesis: true mean is not equal to 21
95 percent confidence interval:
17.91768 22.26357
sample estimates:
mean of x
20.09062
I hate the way R outputs the alternative hypothesis to make it look like we reject the null hypothesis - we don’t!
Two-sample pooled t-test (matching variance)
Compare the means of two independent normally distributed populations from randomly sample data with equal variance.
test statistic: \[t=\frac{\bar{x}_1-\bar{x}_2}{\sqrt{\frac{s_p^2}{n_1}+\frac{s_p^2}{n_2}}}\quad\text{where}\:\begin{cases}
s_p\:\text{is the pooled sample standard deviation} \\
\bar{x}_1,\bar{x}_2\:\text{are the respective sample means}\\
n_1,n_2\:\text{are the respective sample sizes}\\
\text{the t-statistic is compared against t-distribution with }n_1+n_2−2\text{ dof}
\end{cases}\]pooled standard deviation: \[s_p=\sqrt{\frac{(n_1-1)s^2_1+(n_2-1)s^2_2}{n_1+n_2-2}}\]null distribution: Student’s t-distribution
This test compares the means of two independent sets of samples, assuming the two populations are normally distributed with the same variance and different mean i.e. \(X_1\sim N(\mu_1,\sigma)\) and \(X_2\sim N(\mu_2,\sigma)\). The test is called “pooled” because it combines (or pools) the sample variances to estimate a common variance.
Use the pooled t-test only if the Normality assumption is met (or \(n\) is large — CLT), variances are equal (should be checked with an F-test or Levene’s test). Otherwise, use Welch’s t-test, which doesn’t assume equal variances.
pooled t test test statistic:
The pooled t-test statistic is given by: \[\frac{\bar{X}_1 - \bar{X}_2}{\sigma \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} \sim \mathcal{t}_{n_1+n_2-2}\qquad\text{where}\quad\sigma^2 = \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}\]
\[
\begin{align}
&\text{data drawn from two }\textbf{populations}\text{ with differing mean and matching variance}:\\
&\qquad X_1\sim \mathcal{N}\left(\mu_1,\sigma^2\right)
\quad\text{and}\quad X_2\sim \mathcal{N}\left(\mu_2,\sigma^2\right)\\
\\
&\text{...but expectation variances won't match even under }H_0\text{ if the sizes of the datasets differ...}\\
&\qquad\bar{X}_1\sim \mathcal{N}\left(\mu_1,\sigma^2/n_1\right)\quad\text{and}\quad\bar{X}_2\sim \mathcal{N}\left(\mu_2,\sigma^2/n_2\right)\quad\text{(from variance of expectation)}\\
&\qquad\frac{\bar{X}_1-\hat{\mu}_1}{\hat{\sigma}/\sqrt{n_1}}\sim \mathcal{t}_{n_1-1}
\quad\text{and}\quad\frac{\bar{X}_1-\hat{\mu}_2}{\hat{\sigma}/\sqrt{n_2}}\sim \mathcal{t}_{n_2-1}\qquad\text{(standardised)}\\
\end{align}
\]
\[
\begin{align}
&\text{sampling distribution of mean diference estimator has a similar form,}\\
&\qquad\mathbb{E}[\bar{X}_1-\bar{X}_2]=\mathbb{E}[\bar{X}_1]-\mathbb{E}[\bar{X}_2]=\mu'\qquad\text{(linearity of expectation)}\\
&\qquad\operatorname{Var}[\bar{X}_1-\bar{X}_2]=\operatorname{Var}[\bar{X}_1]+\operatorname{Var}[\bar{X}_2]=\frac{\sigma^2}{n_1}+\frac{\sigma^2}{n_2}=\sigma'\qquad\text{(variance of independent sum)}\\
&\quad\therefore\quad\mathbb{E}[X_1]-\mathbb{E}[X_2]\sim\mathcal{N}\left(\mu',{\sigma'}^2\right)\qquad\text{(population distribution)}\\
&\quad\therefore\quad\bar{X}_1-\bar{X}_2\sim\mathcal{t}_{n_1+n_2-2}\left(\mu',{\sigma'}^2\right)\qquad\text{(sampling distribution)}+1 \operatorname{dof} \text{ since mean not estimated under }H_0
\end{align}
\]
\[
\begin{align}
&\:\text{common variance estimated by the }\textbf{pooled}\text{ sample variance,}\\
&\;\text{this is just the weighted average of two sample variances...}\\\\
&\qquad\frac{\left(\bar{X}_1 - \bar{X}_2\right)-\hat{\mu}'}{\hat{\sigma}'\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \sim \mathcal{t}_{n_1+n_2-2}\quad\text{where}\quad\hat{\sigma}'^2=\frac{(n_1-1)\hat{\sigma}_1^2+(n_2-1)\hat{\sigma}_2^2}{n_1+n_2-2}\\\\
&\text{under }H_0:\hat{\mu}_1=\hat{\mu}_2\text{, and linearity of expectation: }\hat{\mu}'=0:\\\\
&\qquad\frac{\bar{X}_1 - \bar{X}_2}{\hat{\sigma}'\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \sim \mathcal{t}_{n_1+n_2-2}\\\\
&\text{we can expres the denominator more simply as...}\\\\
&\qquad\frac{\bar{X}_1 - \bar{X}_2}{\operatorname{SE}} \sim \mathcal{t}_{n_1+n_2-2}\qquad\text{where}\quad\operatorname{SE} = \sqrt{ \hat{\sigma}'^2 \left( \frac{1}{n_1} + \frac{1}{n_2} \right) }\quad\text{and}
\end{align}
\]
toy example: does automatic transmission affect fuel economy?
with(mtcars, t.test(mpg[am ==0], mpg[am ==1]))
Welch Two Sample t-test
data: mpg[am == 0] and mpg[am == 1]
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.280194 -3.209684
sample estimates:
mean of x mean of y
17.14737 24.39231
# ...or...t.test(mpg ~ am, data = mtcars)
Welch Two Sample t-test
data: mpg by am
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-11.280194 -3.209684
sample estimates:
mean in group 0 mean in group 1
17.14737 24.39231
older automatics generally used more fuel than manual transmissions due to factors like the torque converter and fewer available gears. Modern automatics often have more gears and improved technology, leading to fuel economy figures comparable to or even better than manuals.
Two-sample Welch’s t-test (different variance)
Compare the means of two independent normally distributed populations from randomly sample data with different variance. (more common and safer than polled t-test in practice)
test statistic: \[t=\frac{\bar{x}_1-\bar{x}_2}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}\quad\text{where}\:\begin{cases}
s_1,s_2\:\text{are the respective sample standard deviations} \\
\bar{x}_1,\bar{x}_2\:\text{are the respective sample means}\\
n_1,n_2\:\text{are the respective sample sizes}\\
\text{the t-statistic is compared against t-distribution with (typically) fractional dof (see below...)}
\end{cases}\]
Paired t-test Used for mean comparison on matched samples (e.g., before/after measurements or repeated measures on the same subjects). Tests whether the mean of the differences between paired observations is zero.
test statistic: \[t = \frac{\bar{d}}{s_d / \sqrt{n}}\quad\text{where}\:\begin{cases}
d_i=x_i-y_i\:\text{is the paired difference} \\
s_d\:\text{is the standard deviation of the differences} \\
\bar{d}\:\text{is the mean of the differences}\\
n\:\text{number of pairs}\\
\text{the t-statistic is compared against the }n-1\text{ t-distribution}
\end{cases}\]standard deviation: \[s_d = \sqrt{ \frac{1}{n - 1} \sum_{i=1}^{n} (d_i - \bar{d})^2 }\]
This is simply reducing the paired problem to a one-sample t-test on the differences, testing: \[H_0: \mu_d = 0 \quad \text{vs.} \quad H_1: \mu_d \neq 0\]
does the effect of two soporific drugs (increase in hours of sleep compared to control) on 10 patients. 3 populations; control, drug 1 and drug 2…
## Paired t-test## The sleep data is actually paired, so could have been in wide format:sleep2 <-reshape(sleep, direction ="wide",idvar ="ID", timevar ="group")## Traditional interfacet.test(sleep2$extra.1, sleep2$extra.2, paired =TRUE)
Paired t-test
data: sleep2$extra.1 and sleep2$extra.2
t = -4.0621, df = 9, p-value = 0.002833
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-2.4598858 -0.7001142
sample estimates:
mean difference
-1.58
z-test - compare means of two known sampled normal distributions
Uses the standard normal distribution \(N(0,1)\) to determine the probability of observing scenario at least extreme as the one observed.
used when we’re comparing means (sample vs. population or two samples)
the populations are normally distributed (typically a sum of random variables in accordance with the CTL)
the population variance is known (very rarely the case albiet if the sample size is large enough \(n\ge30\) we can approximate the t-test with a z-test)
One-sample z-test Test whether the mean of a samples from a single population differs from a known or hypothesized population with normal pdf distribution
test statistic: \[z=\frac{\bar{x}-\mu}{\sigma/\sqrt{n}}\quad\text{where}\:\begin{cases}
\sigma\:\text{is the population standard deviation} \\
\mu\:\text{is the population mean}\\
\bar{x}\:\text{is the sample mean}\\
n\:\text{is the sample size}\\
\text{the t-statistic is compared against t-distribution with }n−1\text{ dof}
\end{cases}\]
A factory produces light bulbs, and the manufacturer claims that the average lifespan of a light bulb is 1,000 hours. A sample of 50 light bulbs from a new batch has a sample mean lifespan of 1,020 hours. The population standard deviation is known to be 100 hours. Does this sample show a significant difference from the claimed mean?
Null hypothesis \(H_0: \mu=1000\) (the population mean lifespan is 1000 hours).
Alternative hypothesis \(H_0: \mu\ne1000\) (the population mean lifespan is not 1000 hours).
\[z=\frac{1020-1000}{100/\sqrt{50}}\approx1.41\]
z <- (1020-1000)/(100/sqrt(50))pnorm(z,0,1,lower.tail =TRUE) # left 0.92 so this value lies to the right of the distribution
[1] 0.9213504
pnorm(z,0,1,lower.tail =FALSE) # right 0.08 so this is quite extreme
[1] 0.0786496
p <-2*pnorm(z,0,1,lower.tail =FALSE) # p-value = 0.08, this is low but exceeds 0.05print(p)
[1] 0.1572992
The p-value \(p = 0.08\) exceeds the significance level \(\alpha = 0.05\) so we fail to reject the null hypothesis there is no significant evidence to dispute the manufacturer’s claim.
Two-Sample Z-test (Independent Z-test) Test whether the mean of a single sample differs from a known or hypothesized population with normal pdf distribution
test statistic: \[z=\frac{\bar{x}_1-\bar{x}_2}{\sqrt{\frac{\sigma^2}{n_1}+\frac{\sigma^2}{n_2}}}\quad\text{where}\:\begin{cases}
\sigma\:\text{is the known population standard deviation} \\
\bar{x}_1,\bar{x}_2\:\text{are the respective sample means}\\
n_1,n_2\:\text{are the respective sample sizes}\\
\text{the t-statistic is compared against t-distribution with }n_1+n_2−2\text{ dof}
\end{cases}\]pooled standard deviation: \[s_p=\sqrt{\frac{(n_1-1)s^2_1+(n_2-1)s^2_2}{n_1+n_2-2}}\]null distribution: Standard Normal Distribution \(z\sim N(0,1)\)
You want to compare the average test scores of students from two schools. School A has 40 students with a mean score of 85, and School B has 50 students with a mean score of 80. The population standard deviations for both schools are known?!: School A (\(\sigma_A=10\)) and School B (\(\sigma_B=12\)).
Null hypothesis \(H_0:\mu_A=\mu_B\) (the average scores from both schools are equal)
Alternative hypothesis \(H_A:\mu_A\ne\mu_B\) (the average scores from both schools are different)
z <- (85-80)/sqrt((10^2/40)+(12^2/50))pnorm(z,0,1,lower.tail =TRUE) # left 0.98 so the value lies to the right of the distribution
[1] 0.9844446
pnorm(z,0,1,lower.tail =FALSE) # right 0.015 so this is very extreme
[1] 0.01555538
p <-2*pnorm(z,0,1,lower.tail =FALSE)print(p)
[1] 0.03111077
The p-value \(p = 0.02\) is below the significance level \(\alpha = 0.05\) so we reject the null hypothesis there is significant evidence that the schools are performing differently.
Z-test for Proportions (One-Sample Proportion Z-test) Compares a sample proportion to a known population proportion.
test statistic: \[z=\frac{\hat{p}-p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}\quad\text{where}\:\begin{cases}
\hat{p}\:\text{is the sample proportion} \\
p_0\:\text{is the population proportion}\\
n\:\text{is the sample size}
\end{cases}\]null distribution: Standard Normal Distribution \(z\sim N(0,1)\)
Assumes the sample size is large enough such that both \(np\) and \(n(1−p)\) are greater than 5 (where \(p\) is the population proportion).
In a survey of 400 voters, 250 say they support a particular candidate. You want to know if the proportion of voters supporting the candidate is significantly different from 0.60 (i.e., 60%).
Null hypothesis \(H_0:p=0.6\) (the population proportion of supporters is 60%)
Alternative hypothesis \(H_A:p\ne0.6\) (the population proportion of supporters is not 60%)
z <- ((250/400)-0.6)/sqrt(0.6*(1-0.6)/400)pnorm(z,0,1,lower.tail =TRUE) # left 0.85 so the value lies to the right of the distribution
[1] 0.8462829
pnorm(z,0,1,lower.tail =FALSE) # right 0.15 so this is ok
[1] 0.1537171
p <-2*pnorm(z,0,1,lower.tail =FALSE)print(p)
[1] 0.3074342
The p-value \(p = 0.31\) far exceeds the significance level \(\alpha = 0.05\) so we fail to reject the null hypothesis there is no significant evidence to dispute that the proportion of voters supporting the candidate is significantly different from the expected 60%.
Two-Sample Z-test for Proportions Compares a sample proportion to a known population proportion.
test statistic: \[z=\frac{\hat{p}_1-\hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})}\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}\quad\text{where}\:\begin{cases}
\hat{p}_1,\hat{p}_2\:\text{are the sample proportions} \\
\hat{p}\:\text{is the combined proportion of successes in both samples} \\
n_1,n_2\:\text{are the sample sizes}
\end{cases}\]null distribution: Standard Normal Distribution \(z\sim N(0,1)\)
Assumes both the the sample sizes meet the normal approximation i.e. \(n_i=n\) is large enough such that both \(np\) and \(n(1−p)\) are greater than 5 (where \(p\) is the population proportion).
You want to compare the proportions of male and female voters supporting a particular candidate. In a sample of 200 males, 130 support the candidate, and in a sample of 300 females, 180 support the candidate.
Null hypothesis \(H_0:p_1=p_2\) (the proportion of male and female voters supporting the candidate is the same)
Alternative hypothesis \(H_A:p_1\ne p_2\) (the proportions are different)
alpha <- (130+180)/(200+300)z <- ((130/200)-(180/300))/sqrt(alpha*(1-alpha)*((1/200)+(1/300)))pnorm(z,0,1,lower.tail =TRUE) # left 0.87 so the value lies to the right of the distribution
[1] 0.8704299
pnorm(z,0,1,lower.tail =FALSE) # right 0.12 so this is ok
[1] 0.1295701
p <-2*pnorm(z,0,1,lower.tail =FALSE)print(p)
[1] 0.2591402
The p-value \(p = 0.26\) exceeds the significance level \(\alpha = 0.05\) so we fail to reject the null hypothesis there is no significant evidence to dispute that women and men significantly differ in their opinion of the candidate.
F-Test TODO
Mann-Whitney U Test TODO - Compare medians
Mann-Whitney U Test (Wilcoxon Rank-Sum) Compare medians of two independent samples without assuming normality.
test statistic: \[U=\min(U_1,U_2)\quad\text{where}\:\begin{cases}
R_1=\text{sum of ranks for sample }A\\
U_1=R_1-\frac{n_1(n_1+1)}{2}\\
U_2=n_1n_2-U_1
\end{cases}\]null distribution: The Mann-Whitney U distribution which is approximately normally distributed for lare sample sizes.
Used to test whether two independent groups come from the same distribution when the assumptions of the independent two-sample t-test (normality, equal variances) are not met.
Null hypothesis \(H_0\): The distributions of both groups are equal (same median)
Alternative hypothesis \(H_A\): The distributions differ (medians are not equal)
Are the scores of two groups significantly different?
Group A: 85, 90, 88, 75
Group B: 70, 65, 80, 60
groupA <-c(85, 90, 88, 75)groupB <-c(70, 65, 80, 60)wilcox.test(groupA, groupB,alternative ="two.sided",exact =FALSE) # exact = FALSE for larger samples or ties
Wilcoxon rank sum test with continuity correction
data: groupA and groupB
W = 15, p-value = 0.0606
alternative hypothesis: true location shift is not equal to 0
Categorical Data
Chi-squared test - compare sample to a distribution or categorical data from different contexts
Uses the chi-squared distribution \(\chi^2_\nu\) with degrees of freedom \(mu\) based on sample size to determine the probability of observing scenario at least extreme as the one observed..
used when we’re testing categorical data (like frequencies or variances) ~ really the sum of multiple normal distributuions (e.g testing if observed frequencies differ from expected)
testing if variance of a normal distribution differs from a known value
ALWAYS right skewed since chi squared test only measures total deviation, not direction!!
Assumes errors are normally distributed when used for variance testing
Chi-squared Test of Independence test whether two categorical variables are independent of each other.
test statistic: \[\chi^2=\sum\frac{(O-E)^2}{E}\quad\text{where}\:\begin{cases}
O\:\text{observed value} \\
E\:\text{expected value}\\
\quad E=\frac{\text{row total}\times\text{column total}}{\text{grand total}}\\
\chi^2\:\text{statistic is used to calculate the p-value from the} \\
\quad \text{cummulative upper tail chi-squared distrivution where...}\\
\quad \text{degrees of freedom }df=(\text{No. Rows}−1)\cdot(\text{No. Columns}−1)
\end{cases}\]null distribution: pchisqChi Squared Distribution \(\chi^2_\mu\)
Is there an association between education level and job sector?
observed education \ sector
Tech
Non-Tech
Total
Bachelor’s
30
20
50
Master’s
40
10
50
Total
70
30
100
Null hypothesis \(H_0:\) Education level and job sector are independent
Alternative hypothesis \(H_A:\) They are dependent
observed <-matrix(c(30, 20, 40, 10), nrow =2, byrow =TRUE)expected <-matrix(c(0.7*50, 0.3*50, 0.7*50, 0.3*50), nrow =2, byrow =TRUE)chisq.stat <-sum((observed-expected)^2/expected)# ALWAYS right skewed since chi squared test only measures total deviation, not direction!!p <-pchisq(chisq.stat, df =1, lower.tail =FALSE) # right 0.03 so this is lowprint(p)
[1] 0.02909633
Alternatively…
observed <-matrix(c(30, 20, 40, 10), nrow =2, byrow =TRUE)dimnames(observed) <-list(Education =c("Bachelors", "Masters"),Sector =c("Tech", "Non-Tech"))chisq.test(observed, correct=FALSE) # i should run with Yate's correction i guess
The p-value \(p = 0.05\) is at the significance level \(\alpha = 0.05\) so we reject the null hypothesis there is significant evidence to show a dependence between the two variables.
Chi-squared Goodness-of-Fit Test test if an observed distribution matches an expected (theoretical) distribution.
test statistic: \[\chi^2=\sum\frac{(O_i-E_i)^2}{E_i}\quad\text{where}\:\begin{cases}
O_i\:\text{observed value for category }i \\
E\:\text{expected value for category }i \text{ following expected distribution} \\
\chi^2\:\text{statistic is used to calculate the p-value from the} \\
\quad \text{cummulative upper tail chi-squared distrivution where...} \\
\quad \text{degrees of freedom }df=
\end{cases}\]null distribution: pchisqChi Squared Distribution \(\chi^2_\mu\)
Are Rowentree’s Fruitpastels packed in equal colours?
Observed: Red = 15, Blue = 25, Green = 20
Null hypothesis \(H_0:\) They follow a uniform distribution
Alternative hypothesis \(H_A:\) They don’t follow a uniform distribution
observed <-c(15, 25, 20)expected <-rep(20, 3)chisq.stat <-sum((observed-expected)^2/expected)# ALWAYS right skewed since chi squared test only measures total deviation, not direction!!p <-pchisq(chisq.stat, df =2, lower.tail =FALSE) # right 0.29print(p)
Chi-squared test for given probabilities
data: observed
X-squared = 2.5, df = 2, p-value = 0.2865
The p-value \(p = 0.29\) is high and far above the significance level \(\alpha = 0.05\) so we fail to reject the null hypothesis - there is not significant evidence to show Fruitpastels colour variety deviate from uniform probability distribution.
Super Mario Piranha Plant Example
Does the Super Mario Piranha Plant use the hypothosised gear mechanism?
Null hypothesis \(H_0:\) They follow a uniform distribution with \(U(1,13)\)
Alternative hypothesis \(H_A:\) They don’t follow a uniform distribution
barplot(table(observed), main="Frequency of Values", xlab="Values", ylab="Frequency", col="lightblue", border="darkblue")
Chi-squared Test for Homogeneity (test of independence for more than 2 vars) test if different populations have the same distribution of a categorical variable.
test statistic: \[\chi^2=\sum\frac{(O-E)^2}{E}\quad\text{where}\:\begin{cases}
O\:\text{observed value} \\
E\:\text{expected value}\\
\quad E=\frac{\text{row total}\times\text{column total}}{\text{grand total}}\\
\chi^2\:\text{statistic is used to calculate the p-value from the} \\
\quad \text{cummulative upper tail chi-squared distrivution where...}\\
\quad \text{degrees of freedom }df=(\text{No. Rows}−1)\cdot(\text{No. Columns}−1)
\end{cases}\]null distribution: pchisqChi Squared Distribution \(\chi^2_\mu\)
Are product preferences the same in 3 cities?
Observed Product\City
Inverness
Dundee
Aberdeen
Apple
30
40
35
Samsung
20
25
30
Null hypothesis \(H_0:\) Preferences are the same across all cities
Alternative hypothesis \(H_A:\) Preferences differ by city
observed <-matrix(c(30, 40, 35, 20, 25, 30), nrow =2, byrow =TRUE)expected <-matrix(c(105*50/180,105*65/180,105*65/180,75*50/180,75*65/180,75*65/180), nrow =2, byrow =TRUE)chisq.stat <-sum((observed-expected)^2/expected)# ALWAYS right skewed since chi squared test only measures total deviation, not direction!!p <-pchisq(chisq.stat, df =2, lower.tail =FALSE) # right 0.03 so this is lowprint(p)
[1] 0.647158
Alternatively…
observed <-matrix(c(30, 40, 35, 20, 25, 30), nrow =2, byrow =TRUE)dimnames(observed) <-list(Education =c("Apple", "Samsung"),Sector =c("Inverness", "Dundee", "Aberdeen"))chisq.test(observed, correct=FALSE) # i should run with Yate's correction i guess
The p-value \(p = 0.65\) is high and far above the significance level \(\alpha = 0.05\) so we fail to reject the null hypothesis - there is not significant evidence to show product preference varies between the cities.
Comparing Variances
F-test - Compare two variances
often used before t-tests but it really does fuck all - only catching variances that are seriously out - use a weltch t-test and grow up.
test statistic: \[F=\frac{s_1^2}{s_2^2}\quad\text{where}\:\begin{cases}
s_1,s_2\:\text{sample std.dev} \\
df_1=n_1-1:\text{degrees of freedom for sample size }n_1\\
df_2=n_2-1:\text{degrees of freedom for sample size }n_2\\
\end{cases}\]null distribution: pfF-distribution \(F_{df_1,df_2}\)
Two machines produce metal rods. We want to test whether the variability in lengths (i.e., variances) differs between the two machines.
var.test(x =rnorm(15, sd =4.2), y =rnorm(12, sd =3.1))
F test to compare two variances
data: rnorm(15, sd = 4.2) and rnorm(12, sd = 3.1)
F = 2.6988, num df = 14, denom df = 11, p-value = 0.1048
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.803501 8.351701
sample estimates:
ratio of variances
2.698807
The p-value \(p = 0.8782\) is high and far above the significance level \(\alpha = 0.05\) so we fail to reject the null hypothesis - there is not significant evidence to show the population variances differ.
TODO: - Levene’s test Compare variances without assuming normality (more robust than F) - Bartlett’s test Like Levene, but assumes normality — for 2+ groups
Correlation & Regression
Pearson correlation test - test linear relationship between two continuous variables
dig out the C++ implementation I used to validate dysis camera calibration data in the sys_info validation step
test statistic: \[t=\frac{r\sqrt{n-2}}{\sqrt{1-r^2}}\quad\text{where}\:\begin{cases}
r\:\text{Pearson correlation coefficient}\\
n\:\text{is the sample size}\\
\text{the t-statistic is compared against t-distribution with }n−2\text{ dof}
\end{cases}\]
Null hypothesis \(H_0:\sigma_1^2=\sigma^2_2\) (equal population variances)
Alternative hypothesis \(H_A:\sigma_1^2\ne\sigma^2_2\) (different population variances)
x <-c(1, 2, 3, 4, 5)y <-c(2, 4, 5, 4, 5)cor.test(x, y, method ="pearson")
Pearson's product-moment correlation
data: x and y
t = 2.1213, df = 3, p-value = 0.124
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3400820 0.9842358
sample estimates:
cor
0.7745967
Spearman correlation test - test monotonic non-linear relationship for ranked data
Rather than working with raw values, it converts the data to ranks and then computes the Pearson correlation of the ranks.
Spearman correlation coefficient (for no ties, generally more complex): \[\rho_s=1-\frac{6\sum d_i^2}{n(n^2-1)}\quad\text{where}\:\begin{cases}
d_i\:\text{is the difference between the ranks of each pair} \\
n\:\text{number of pairs}
\end{cases}\]
test statistic: \[t=\frac{\rho_s\sqrt{n-2}}{\sqrt{1-\rho_s^2}}\quad\text{where}\:\begin{cases}
r\:\text{Pearson correlation coefficient}\\
n\:\text{is the sample size}\\
\text{the t-statistic is compared against t-distribution with }n−2\text{ dof}
\end{cases}\]
Null hypothesis \(H_0:\rho_s=0\) (no monotonic relationship)
Alternative hypothesis \(H_A:\rho_s\ne0\) (there is a monotonic relationship)
x <-c(10, 20, 30, 40, 50)y <-c(1, 2, 3, 6, 5)cor.test(x, y, method ="spearman")
Spearman's rank correlation rho
data: x and y
S = 2, p-value = 0.08333
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.9
Linear regression - determine if a linear relationship exists
slope \(\beta_1\): \[y=\beta_0+\beta_1 x+\epsilon\quad\text{where}\:\begin{cases}
\beta_0\:\text{is the y-intercept} \\
\beta_1\:\text{is the gradient} \\
\epsilon\:\text{is the random error}
\end{cases}\]
test statistic: \[t=\frac{\hat{\beta}_1}{SE(\hat{\beta}_1)}\quad\text{where}\:\begin{cases}
\hat{\beta}_1\:\text{is the estimated slope for the sample}\\
SE(\hat{\beta}_1)\:\text{is the standard error of the slope}\\
\text{the t-statistic is compared against t-distribution with }n−2\text{ dof}
\end{cases}\]
Call:
lm(formula = y ~ x)
Residuals:
1 2 3 4 5
-0.8 0.6 1.0 -0.6 -0.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.2000 0.9381 2.345 0.101
x 0.6000 0.2828 2.121 0.124
Residual standard error: 0.8944 on 3 degrees of freedom
Multiple R-squared: 0.6, Adjusted R-squared: 0.4667
F-statistic: 4.5 on 1 and 3 DF, p-value: 0.124
The p-value \(p = 0.12\) is high and far above the significance level \(\alpha = 0.05\) so we fail to reject the null hypothesis - there is not very significant evidence to show that a linear relationship exists.
One Way ANOVA (Analysis of Variance) - Compare means of 3 or more groups to check if any come from different populations
test statistic: \[F=\frac{\text{between group variability}}{\text{within group variability}}=\frac{MS_{between}}{MS_{within}}\quad\text{where}\:\begin{cases}
MS_{between}=\frac{SS_{between}}{df_{between}}=\frac{\text{sum of squares}}{\text{degrees of freedom}} \\
df_{between}=k-1\\
\\
MS_{within}=\frac{SS_{within}}{df_{within}}=\frac{\text{sum of squares}}{\text{degrees of freedom}} \\
df_{within}=N-k\\
\\
k\:\text{is the number of groups} \\
N\:\text{is the total number of observations}
\end{cases}\]
Df Sum Sq Mean Sq F value Pr(>F)
group 2 66.70 33.35 3.651 0.069 .
Residuals 9 82.22 9.14
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value \(p = 0.00875\) is high and far below the significance level \(\alpha = 0.05\) so we reject the null hypothesis - at least one group mean differs - let’s see whuch mean differs using a tuckey test.
boxplot(scores ~ group,col ="skyblue",main ="Group Comparison via Boxplot",xlab ="Group",ylab ="Scores")
TODO: - Kruskal-Wallis test Non-parametric version of ANOVA
Hypothesis Dredging Avoidance
The Bonferroni correction is a method used to control the family-wise error rate (FWER) when performing multiple hypothesis tests. Given \(m\) independent tests and a desired overall significance level \(\alpha\), the corrected significance level for each individual test is:
\[
\alpha' = \frac{\alpha}{m}
\]
A hypothesis test is considered statistically significant only if its p-value satisfies \(p < \alpha'\). This approach is straightforward and conservative, reducing the likelihood of Type I errors, but potentially increasing the risk of Type II errors when the number of tests is large.
Starting with the definition of covariance and using the linearity of expectation\[
\begin{align}
\operatorname{Cov}(X,Y)\equiv&\mathbb{E}\left[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])\right]\\
=&\mathbb{E}\left[XY - X\mathbb{E}[Y] - \mathbb{E}[X]Y + \mathbb{E}[X]\mathbb{E}[Y]\right]\\
=&\mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] - \mathbb{E}[X]\mathbb{E}[Y] + \mathbb{E}[X]\mathbb{E}[Y]\qquad\text{(linearity of expectation)}\\
=&\mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]
\end{align}
\]
\[
\begin{align}
&\text{data drawn from two }\textbf{populations}\text{ with differing mean and matching variance}:\\
&\qquad X_1\sim \mathcal{N}\left(\mu_1,\sigma^2\right)
\quad\text{and}\quad X_2\sim \mathcal{N}\left(\mu_2,\sigma^2\right)\\\\
&\text{...but expectation variances won't match even under }H_0\text{ if the sizes of the datasets differ...}\\
&\qquad\bar{X}_1\sim \mathcal{N}\left(\mu_1,\sigma^2/n_1\right)\quad\text{and}\quad\bar{X}_2\sim \mathcal{N}\left(\mu_2,\sigma^2/n_2\right)\quad\text{(from variance of expectation)}\\
&\qquad\frac{\bar{X}_1-\hat{\mu}_1}{\hat{\sigma}/\sqrt{n_1}}\sim \mathcal{t}_{n_1-1}
\quad\text{and}\quad\frac{\bar{X}_1-\hat{\mu}_2}{\hat{\sigma}/\sqrt{n_2}}\sim \mathcal{t}_{n_2-1}\qquad\text{(standardised)}\\
\end{align}
\]
\[
\begin{align}
&\text{sampling distribution of mean diference estimator has a similar form,}\\
&\qquad\mathbb{E}[\bar{X}_1-\bar{X}_2]=\mathbb{E}[\bar{X}_1]-\mathbb{E}[\bar{X}_2]=\mu'\qquad\text{(linearity of expectation)}\\
&\qquad\operatorname{Var}[\bar{X}_1-\bar{X}_2]=\operatorname{Var}[\bar{X}_1]+\operatorname{Var}[\bar{X}_2]=\frac{\sigma^2}{n_1}+\frac{\sigma^2}{n_2}=\sigma'\qquad\text{(variance of independent sum)}\\
&\quad\therefore\quad\mathbb{E}[X_1]-\mathbb{E}[X_2]\sim\mathcal{N}\left(\mu',{\sigma'}^2\right)\qquad\text{(population distribution)}\\
&\quad\therefore\quad\bar{X}_1-\bar{X}_2\sim\mathcal{t}_{n_1+n_2-2}\left(\mu',{\sigma'}^2\right)\qquad\text{(sampling distribution)} +1 \operatorname{dof} \text{ since mean not estimated under }H_0
\end{align}
\]
\[
\begin{align}
&\:\text{common variance estimated by the }\textbf{pooled}\text{ sample variance,}\\
&\;\text{this is just the weighted average of two sample variances...}\\\\
&\qquad\frac{\left(\bar{X}_1 - \bar{X}_2\right)-\hat{\mu}'}{\hat{\sigma}'\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \sim \mathcal{t}_{n_1+n_2-2}\quad\text{where}\quad\hat{\sigma}'^2=\frac{(n_1-1)\hat{\sigma}_1^2+(n_2-1)\hat{\sigma}_2^2}{n_1+n_2-2}\\\\
&\text{under }H_0:\hat{\mu}_1=\hat{\mu}_2\text{, and linearity of expectation: }\hat{\mu}'=0:\\\\
&\qquad\frac{\bar{X}_1 - \bar{X}_2}{\hat{\sigma}'\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \sim \mathcal{t}_{n_1+n_2-2}\\\\
&\text{we can expres the denominator more simply as...}\\\\
&\qquad\frac{\bar{X}_1 - \bar{X}_2}{\operatorname{SE}} \sim \mathcal{t}_{n_1+n_2-2}\qquad\text{where}\quad\operatorname{SE} = \sqrt{ \hat{\sigma}'^2 \left( \frac{1}{n_1} + \frac{1}{n_2} \right) }\quad\text{and}
\end{align}
\]
welchtest statistic:
The pooled t-test statistic is given by: \[\frac{\bar{X}_1 - \bar{X}_2}{\sigma \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} \sim \mathcal{t}_{n_1+n_2-2}\qquad\text{where}\quad\sigma^2 = \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}\]
\[
\begin{align}
&\text{data drawn from two }\textbf{populations}\text{ with matching mean and matching variance, under }H_0:\\
&\qquad X_1\sim \mathcal{N}\left(\mu,\sigma_1^2\right)
\quad\text{and}\quad X_2\sim \mathcal{N}\left(\mu,\sigma_2^2\right)\\
&\qquad\bar{X}_1\sim \mathcal{N}\left(\mu,\sigma_1^2/n_1\right)\quad\text{and}\quad\bar{X}_2\sim \mathcal{N}\left(\mu,\sigma_2^2/n_2\right)\quad\text{(from variance of expectation)}\\
&\qquad\frac{\bar{X}_1-\hat{\mu}}{\hat{\sigma_1}/\sqrt{n_1}}\sim \mathcal{t}_{n_1-1}
\quad\text{and}\quad\frac{\bar{X}-\hat{\mu}}{\hat{\sigma_2}/\sqrt{n_2}}\sim \mathcal{t}_{n_2-1}\qquad\text{(standardised sampling distributions)}\\
\end{align}
\]
\[
\begin{align}
&\text{sampling distribution of mean diference estimator has a similar form,}\\
&\qquad\mathbb{E}[\bar{X}_1-\bar{X}_2]=\mathbb{E}[\bar{X}_1]-\mathbb{E}[\bar{X}_2]=0\qquad\text{(linearity of expectation)}\\
&\qquad\operatorname{Var}[\bar{X}_1-\bar{X}_2]=\operatorname{Var}[\bar{X}_1]+\operatorname{Var}[\bar{X}_2]=\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}\qquad\text{(variance of independent sum)}\\
&\quad\therefore\quad\mathbb{E}[X_1]-\mathbb{E}[X_2]\sim\mathcal{N}\left(0,\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}\right)\qquad\text{(sampling distribution {population})}\\
&\text{the two variance estimates are coupled - so in estimating both we }\textbf{loose a fractional dof}\\
&\quad\therefore\quad\boxed{\frac{\bar{X}_1-\bar{X}_2}{\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}}\sim\mathcal{t}_{\nu}}\qquad\text{(weltch -test statistic)}
\end{align}
\]
degrees of freedom:
\[
\begin{align}
&\text{test statistic is a function of sampling distributions for sample variance of the two groups...}\\
&\qquad\boxed{
\begin{aligned}
&\text{given i.i.d. data}\quad X_1,\dots,X_n\sim\mathcal{N}(\mu,\sigma^2)\quad:\\
&\;s^2=\frac{1}{n-1}\sum_{i=1}^n(X_i-\bar{X})^2\quad\text{(sample variance)}\\
&\;\frac{(n-1)s^2}{\sigma^2}\sim\chi^2_{n-1}\quad\text{(sample variance sampling distribution)}
\end{aligned}
}\\\\
&\quad t_\nu\sim\frac{(\bar{X}-\bar{Y})-(\mu_1-\mu_2)}{\sqrt{\frac{s_1}{n_1}+\frac{s_2}{n_2}}}\qquad\text{(standardised test statistic)}
\\\\
&\quad\text{where}\quad s_1^2\sim\frac{\sigma_1^2}{n_1-1}\chi^2_{n_1-1}\qquad\text{and}\qquad s_2^2\sim\frac{\sigma_2^2}{n_2-1}\chi^2_{n_2-1}
\end{align}
\] using:
\[
\begin{align}
\text{lets now look at the}\quad&\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}\quad\text{term noting}\quad\mathbb{E}[\chi^2_\nu]=\nu\quad\text{and}\quad\operatorname{Var}(\chi^2_\nu)=2\nu:\\
\operatorname{Var}\left(\frac{s_1^2}{n_1}\right)&=\operatorname{Var} \left( \frac{\sigma_1^2}{n_1} \cdot \frac{\chi^2_{n_1 - 1}}{n_1 - 1} \right)\\
&=\underbrace{\left(\frac{\sigma_1^2}{n_1}\cdot\frac{1}{n_1-1}\right)^2}_{\text{scaling of variance}}\cdot\operatorname{Var}(\chi^2_{n_1-1})\\
&=\frac{\sigma_1^4}{n_1^2}\cdot\frac{1}{(n_1-1)^2}\cdot2\:\cdot(n_1-1)\\\\
\text{so}\quad\operatorname{Var}\left(\frac{s_1^2}{n_1}\right)=\frac{\sigma_1^4}{n_1^2}&\cdot\frac{2}{n_1-1}\qquad\text{and}\qquad\operatorname{Var}\left(\frac{s_1^2}{n_1}\right)=\frac{\sigma_1^4}{n_1^2}\cdot\frac{2}{n_1-1}\\\\
\therefore\qquad\operatorname{Var}\left(\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}\right)&\quad=\quad\frac{\sigma_1^4}{n_1^2}\cdot\frac{2}{n_1-1}\quad+\quad\frac{\sigma_2^4}{n_2^2}\cdot\frac{2}{n_2-1}
\end{align}
\]\[
\begin{align}
&\text{remember the Student's t-distribution is a function of the z and chi-squared distribution...}\\
&\qquad\qquad\boxed{t_\nu\sim\frac{Z}{\sqrt{V/\nu}}\qquad\text{where}\quad Z\sim N(0,1)\quad\text{and}\quad V\sim\chi^2_\nu}\\
&\text{we previously calculated the test statistic}\\
&\qquad\qquad\boxed{\frac{\bar{X}_1-\bar{X}_2}{\sqrt{\frac{\hat{\sigma}_1^2}{n_1}+\frac{\hat{\sigma}_2^2}{n_2}}}\sim\mathcal{t}_{\nu}}\qquad\text{(weltch test statistic)}\\
&\text{under }H_0\text{ the means follows a normal distribution with zero mean}\\
&\:\text{notice numerator uses population parameters and denominator uses estimates...}\\
&\qquad\qquad\frac{N(0,\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2})}{\sqrt{\frac{\hat{\sigma}_1^2}{n_1}+\frac{\hat{\sigma}_2^2}{n_2}}}\sim\mathcal{t}_{\nu}\quad\text{or}\quad\frac{Z\cdot\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}}{\sqrt{\frac{\hat{\sigma}_1^2}{n_1}+\frac{\hat{\sigma}_2^2}{n_2}}}=\frac{Z}{\sqrt{\hat{\sigma}^2/\sigma^2}}\sim\mathcal{t}_{\nu}\\
&\text{now we can find the degrees of freedom solving for}\:\nu:\\
&\qquad\qquad\frac{\hat{\sigma}_1^2}{n_1}+\frac{\hat{\sigma}_2^2}{n_2}\approx\frac{\sigma^2}{\nu}\chi^2_\nu\qquad\text{Satterthwaite approximation}\\
&\nu\text{ assumes matching the first and second moments of both sides!}
\end{align}
\]
welch satterthwaite approximation:
We want to estimate the degrees of freedom for a statistic of the form:
\[
V = \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}
\]
where each \(s_i^2\) is the sample variance from a normal distribution with unknown population variance \(\sigma_i^2\).
We approximate \(V\) by a scaled chi-squared distribution:
\[
V \approx \frac{\sigma^2}{\nu} \chi^2_\nu
\qquad \text{(Satterthwaite approximation)}
\]
Step 1: Match Moments
The goal is to approximate the distribution of \(V\) using a scaled chi-squared variable. For a chi-squared distribution:
\(\mathbb{E}[\chi^2_\nu] = \nu\)
\(\text{Var}(\chi^2_\nu) = 2\nu\)
So for \(V \approx \frac{\sigma^2}{\nu} \chi^2_\nu\), we have:
\(\mathbb{E}[V] = \sigma^2\)
\(\text{Var}(V) = \frac{2\sigma^4}{\nu}\)
Step 2: Compute Moments from Components
Since \(V = \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\), we compute:
\[
\begin{align}
&\text{data drawn from two groups with same mean and matching (unknown/estimated) variance}:\\
&\qquad X_1\sim \mathcal{N}\left(\mu_1,\sigma^2\right)
\quad\text{and}\quad X_2\sim \mathcal{N}\left(\mu_2,\sigma^2\right)\qquad\text{(population)}\\
&\qquad\bar{X}_1\sim \mathcal{N}\left(\mu_1,\sigma^2/n_1\right)\quad\text{and}\quad\bar{X}_2\sim \mathcal{N}\left(\mu_2,\sigma^2/n_2\right)\quad\text{(variance of expectation)}\\
&\qquad\frac{\bar{X}_1-\hat{\mu}_1}{\hat{\sigma}/\sqrt{n_1}}\sim \mathcal{t}_{n_1-1}
\quad\text{and}\quad\frac{\bar{X}_1-\hat{\mu}_2}{\hat{\sigma}/\sqrt{n_2}}\sim \mathcal{t}_{n_2-1}\qquad\text{(standardised)}\\
\end{align}
\]
\[
\begin{align}
&\text{sampling distribution of mean diference estimator has a similar form,}\\
&\qquad\bar{X}_1-\bar{X}_2\sim\mathcal{N}\left(\mu',\sigma^2\right)\qquad\text{(actual unknown distribution)}\\
&\qquad\mathbb{E}[\bar{X}_1-\bar{X}_2]=\mathbb{E}[\bar{X}_1]-\mathbb{E}[\bar{X}_2]=\mu'\qquad\text{(linearity of expectation)}\\
&\qquad\operatorname{Var}[\bar{X}_1-\bar{X}_2]=\operatorname{Var}[\bar{X}_1]+\operatorname{Var}[\bar{X}_2]=\frac{\sigma^2}{n_1}+\frac{\sigma^2}{n_2}=\sigma'\qquad\text{(variance of independent sum)}\\
\end{align}
\]
\[
\begin{align}
&\:\text{common variance estimated by the pooled sample variance,}\\
&\;\text{the weighted average of two sample variances...}\\
&\;\:\text{(variances don't match even under }H_0\text{ if the sizes of the datasets differ)}\\
&\qquad\bar{X}_1-\bar{X}_2\sim\mathcal{N}\left(\mu_1-\mu_2,\frac{\sigma_1^2}{2}+\frac{\sigma_2^2}{2}\right)\\
&\qquad\frac{\left(\bar{X}_1 - \bar{X}_2\right)-\hat{\mu}'}{\hat{\sigma}'\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \sim \mathcal{t}_{n_1+n_2-2}\quad\text{where}\quad\operatorname{SE} = \sqrt{ \hat{\sigma}'^2 \left( \frac{1}{n_1} + \frac{1}{n_2} \right) }\quad\text{and}\quad\hat{\sigma}'^2=\frac{(n_1-1)\hat{\sigma}_1^2+(n_2-1)\hat{\sigma}_2^2}{n_1+n_2-2}\\
&\qquad\frac{\bar{X}_1 - \bar{X}_2}{\hat{\sigma}'\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \sim \mathcal{t}_{n_1+n_2-2}\quad\text{(under }H_0:\hat{\mu}_1=\hat{\mu}_2\text{, and linearity of expectation: }\hat{\mu}'=0\text{)}
\end{align}
\] $$
$$
Footnotes
Ronald A. Fisher introduced the idea of hypothesis testing in the early 20th century, using the term “significance” to describe whether an observed result was strong enough to warrant rejecting a null hypothesis. The significance level (\(\alpha = 5\%\)) sets the threshold for how much evidence is needed before rejecting the null hypothesis. A high significance level corresponds to a low threshold for rejecting the null hypothesis and visa versa.↩︎
Power is the ability of a test to detect a true effect (\(\approx 80\%\)) - the opposite of a Type II error (\(\beta\)), which is failing to detect an effect when one exists.↩︎