Frequentist Modelling: Null Hypothesis Significance Testing (Neyman-Pearson) Paradigm

Author

D.McCabe

Published

February 9, 2026

NHST: The frequentist treats the hypothesis as fixed — it’s either true or false. The sampling distribution tells us the likelihood of obtaining data like ours (or more extreme) if the hypothesis were true. We don’t attach a probability to the truth of the hypothesis itself.

Statistical processes produce samples from an underlying probability mass function or probability density function, and frequentists model real-world phenomena as if they arise from such processes. In practice, we don’t observe these full distributions directly — instead, we infer them from randomly drawn samples. This is the foundation of statistical inference - a practice which does not fully align with the scientific method. One core method in this framework is null hypothesis significance testing (NHST), which provides a structured way to test whether the distribution we suspect is driving the data could be wrong. It allows us to make probabilistic statements about how surprising our observed data would be if the null hypothesis were true — thereby quantifying the evidence against that hypothesis.

NHST is often criticised for primarily detecting large, obvious effects and for its limitations in assessing nuanced hypotheses. However, when thoughtfully designed with adequate power, precise hypotheses, and appropriate controls, NHST can be a robust tool for rigorous scientific inference —shiftingits role from merely flagging extreme failures to genuinely testing theory-driven predictions.

NHST provides a quantitative framework/justification for determining whether or not a hypothesis can still stand in the face of data (e.g. is a coin consigered fair given a skewed distribution of heads and tails). It is a frequentist technique based solely on the likelihood with the chosen null/alternative hypothesis being highly subjective (like the priors of Bayesian statistics). The Neyman-Pearson approach helps determine when the null hypothesis should be rejected based on the significance of the test. This means deciding how much risk we’re willing to accept when we incorrectly reject the null hypothesis, given that the observed data is unlikely under the null hypothesis.

Comparison of a High Power and a Low Power Test

Note

All computations in frequentist statistics involve the likelihood function $P(x|\theta)$, which plays a central role in parameter estimation and hypothesis testing. In frequentist inference, we do not assign probabilities to hypotheses. Specifically $P(H_0)$ and $P(H_0|x)$ are undefined and instead, inference is made by assuming $H_0$ is true and evaluating how likely the observed data $x$ is under this assumption: \[\boxed{\text{If } P(x|H_0)\text{ is small} \Rightarrow \text{Reject } H_0}\quad\text{(frequentist logic)}\] The likelihood function $\mathcal{L}(\theta|x) = P(x|\theta)$ is not a probability distribution over parameters, but a function of how plausible different values of $\theta$ are given fixed observed data $x$, specifically;

Low likelihood under $H_0$ - likely small p-value - high chance of rejection
High likelihood under $H_0$ - likely large p-value - low chance of rejection

Contrast this with the Bayesian expression in which probabilities are assigned to the prior and posterior probabilities: \[\boxed{P(H_0|x) = \frac{P(x|H_0)P(H_0)}{P(x)}}\quad\text{(Bayes' theorem — not used in frequentist inference)}\]

Glossary/Terminology

NHST term	symbol	description
nullable hypothesis	$H_0$	Falsifiable (no daemons), simple or suspected case - only rejected in the face of very compelling evidence.
alternative hypothesis	$H_A$	the alternative if $H_0$ is rejected
test statistic	$T\sim\{t,\chi^2,F,...\}$	random variable with probability distribution given by the null distribution
null distribution	$P(T\|H_0)$	the PDF/PMF of test statistic $T$ under $H_0$
p-value	$p=1-F(t_{obs}\|H_0)$	the cumulative “tail” probability of test statistic $t$ distribution given $H_0$ probability of obtaining a test statistic as extreme as or more extreme than the observed value
critical value/threshold	$\{z_\alpha, t_\alpha\dots\}$	value of the test statistic that defines the rejection threshold for $H_0$
Rejection Region	if $T$ is in the rejection region we reject $H_0$ in favour of $H_A$	extreme under the null hypothesis and more likely under the alternative hypothesis, marked red below
Type I Error	$\alpha = P(\text{reject}\:H_0\|H_0)$	false rejection of $H_0$
Type II Error	$\beta=P(\text{fail to reject}\:H_0\|H_A)$	false “acceptance” of $H_0$
significance Level¹	$\alpha = P(\text{reject}\:H_0\|H_0)$	likelihood of Type I error probability of finding an innocent person guilty (ideally small $~0$)
power²	$\text{Power}=1-\beta$	probability of correct rejection of $H_0$ probability of correctly finding a guilty party guilty (ideally large $~1$)
Effect size	Cohen’s $d$, Pearson’s $r$, odds ratio $\text{OR}$, eta squared $\eta^2$	Standardised measure of the magnitude of an effect (sample-size independent)

Null Hypothesis Non-Rejection/Rejection

in the case of the fair coin experiment example for a sign…

data.table(
  x = 0:10,
  p=dbinom( 0:10, 10, 0.5)
)

        x            p
    <int>        <num>
 1:     0 0.0009765625
 2:     1 0.0097656250
 3:     2 0.0439453125
 4:     3 0.1171875000
 5:     4 0.2050781250
 6:     5 0.2460937500
 7:     6 0.2050781250
 8:     7 0.1171875000
 9:     8 0.0439453125
10:     9 0.0097656250
11:    10 0.0009765625

\[P(\text{rejecting}\:H_0 | H_0) = 0.11\]

1. - pbinom(7,10,0.5) + pbinom(2,10,0.5)

[1] 0.109375

Hypotheses: Simple & Composite

A simple hypothesis completely specifies the probability distribution of the population using fixed/hypothesised parameter values e.g. $H:\:X\sim\text{Norm}(0,1)$. For the example coin experiment the null hypothesis $H_0:\:X\sim \text{Bin}(10,0.5)$ is a simple hypothesis.

A composite hypothesis specifies the probability distribution of the population using unknown/flexable parameter values e.g. $H:\:X\sim\text{Norm}(\mu,\sigma)$ since parameters $\mu$ and $\sigma$ are variable the hypothesis is a sort of compound hypothesis rather than a straightforward single hypotheses. For the example coin experiment the alternative hypthesis $H_0:\:X\sim \text{Bin}(10,p)\quad\text{where}\:p\ne 0.5$ is a composite hypothesis since $p$ has a range of possible values.

Power: The Actual Test Statistic Distribution

Somewhat paradoxically, we need the actual test statistic distribution to compute power of an experiment. This is solved by estimating the alternative test statistic to compute the power.

estimation method	use case
Effect size estimates	Prior research or pilot studies available
Normal/t-distribution approximation	Large samples or known distributions
Bootstrapping	Small samples, unknown distributions
Monte Carlo simulation	Complex models, no closed formula
Nonparametric methods	No assumptions about distribution

Typically we can increase the power of a test by increasing the amount of data and thereby decreasing the variance of the null and alternative distributions. In experimental design it is important to determine ahead of time the number of trials or subjects needed to achieve a desired power.

Significance ($\alpha$-value)

for a composite $H_0$ the distribution instance which corresponds to the highest significance is the value/distribution we use for the significance level.

Significance analogy in Legal Trials. In criminal law, the standard of proof is “beyond a reasonable doubt,” which is designed to minimise Type I errors (wrongfully convicting an innocent person). This is equivalent to setting a very low significance level ($\alpha$), requiring very strong evidence before rejecting the presumption of innocence $H_0$. However, this does not mean that Type I errors (wrongful convictions) are impossible — only that the legal system is structured to minimise them. The trade-off is that reducing Type I errors increases the probability of Type II errors (failing to convict a guilty person), just as lowering $\alpha$ in statistics increases the chance of failing to reject a false null hypothesis. The bottom line is this: the legal system implicitly acknowledges that some wrongful convictions will occur since appeals exist—to correct wrongful convictions. We prosecuting there is always a chance of error.

Controlling Power and Significance

%%{init: {'theme':'neutral', 'fontFamily':'Arial', 'flowchart': {'useMaxWidth':false}}}%%
flowchart LR
  A["α Increase"] --> B["Power (1-β)"]
  A --> C["Type I Error Increase"]
  D["Sample Size Increase"] --> B
  E["Effect Size Increase"] --> B
  F["Variability Reduction (data std.dev.)"] --> B

  linkStyle 0,1 stroke:#ff6b6b,stroke-width:2px
  linkStyle 2,3,4 stroke:#4dabf7,stroke-width:2px

  class A,C critical;
  class B success;
  classDef critical fill:#ff8787,stroke:#fa5252
  classDef success fill:#69db7c,stroke:#2b8a3e

Tip

Diagram Legend

	Red Nodes: Significance threshold ($\alpha$) and Type I Error
	Green Node: Statistical power (1-$\beta$)
	Red Arrows: Trade-off (higher $\alpha$ - more power but more false positives)
	Blue Arrows: Power-boosting factors

$p$-value - the heart of NHST tests

If the $p$-value is calculated by looking up the tests statistic (or even just the random variable value) against whatever CDF/CMF models the null distribution. The $p$-value is the probability, assuming the null hypothesis $H_0$, of observing a result at least as extreme as the one we got. If the $p$-value is less than the significance level $\alpha$ then we reject $H_0$. Otherwise we fail to reject $H_0$.

Test Type	Alternative Hypothesis	Area Considered	Summary
Two-tailed	$H_1:\mu\ne\mu_0$	Both ends (tails) of the distribution p-value is the smaller CMF value left or right $\times2$	Surprise in either direction
One-tailed (left)	$H_1:\mu<\mu_0$	Left tail only, p-value is the CMF value is `p<dist>(...)` or `p<dist>(lower.tail = TRUE, ...)`	Surprise only if result is too small
One-tailed (right)	$H_1:\mu>\mu_0$	Right tail only, p-value is the CMF value is `1-p<dist>(...)` or `p<dist>(lower.tail = FALSE, ...)`	Surprise only if result is too large

one-tailed (right) example

IQ scores are designed to follow a normal distribution $IQ\sim N(100,15^2)$ with mean $\mu=100$ and standard deviation $\sigma=15$. This guy claims to have an IQ of $173$. How extreme is an IQ of $173$ under $N(100,15^2)$?

\[z=\frac{100-173}{15}=\frac{73}{15}\approx4.87\]

# Probability of getting an IQ ≥ 173 in a standard population
1 - pnorm(173, mean = 100, sd = 15)

[1] 5.674811e-07

pnorm(173, mean = 100, sd = 15, lower.tail = FALSE)

[1] 5.674811e-07

pnorm(73/15, mean = 0, sd = 1, lower.tail = FALSE)

[1] 5.674811e-07

Probability of randomly finding someone with an IQ that high $0.0000057\%$. I reject the null hypothesis that this is a true claim.

Common Tests/Test Statistic…

Comparing Means

Student’s t-test - compare means of two unknown sampled normal distributions

The t-test tests the similarity between two groups of samples (each group being drawn from a pdf with normal distribution), it uses the t-statistic, a random variable which follows the Student’s t-distribution. The statistic is low for close matching groups and high for distinct groups. It penalises the difference between the groups (numerator) and rewards variability between the groups (denominator).

the t-test calculates the p-value using a t-statistic against the Student’s t-distribution. This is equivalent to calculating the p-value from the appropriate normal distribution however, here we don’t actually know the $\mu$ and $\sigma$ for that distribution so fall back to the t-test.

\[t=\frac{\text{signal}}{\text{noise}}=\frac{\text{difference between group means}}{\text{variability between groups}}=\frac{\bar{x}_1-\bar{x}_2}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}\]

Uses the t-distribution $t_\nu$ with degrees of freedom $mu$ based on sample size to determine the probability of observing scenario at least extreme as the one observed.

used when we’re comparing two sample means
the populations are normally distributed (typically a sum of random variables in accordance with the CTL)
the population variance is unknown (so you estimate it from the sample)

There are a few variations of the t-test:

Consider the mtcars dataset. Construct a 95% T interval for MPG comparing 4 to 6 cylinder cars (subtracting in the order of 4 - 6) assume a constant variance.

One-sample t-test

Test whether the estimated mean of normally distributed random sample data differs from a known or hypothesised population mean.

test statistic: \[t=\frac{\bar{x}-\mu_0}{s/\sqrt{n}}\quad\text{where}\:\begin{cases} s\:\text{is the sample standard deviation} \\ \mu_0\:\text{is the hypothesised population mean}\\ \bar{x}\:\text{is the sample mean}\\ n\:\text{is the sample size}\\ \text{the t-statistic is compared against t-distribution with }n−1\text{ dof} \end{cases}\]

null distribution: Student’s t-distribution

Here we are getting the p-value for the hypothesised population mean whose cumulative probability distribution corresponds to the sampling distribution of the mean estimator for unknown $sigma$ i.e. how extreme is the observed data under the null hypothesis?

Is the average miles per gallon of cars equal to $21$ mpg?

Here I’ll use the mtcars dataset assuming the cars were randomly sampled from the population (it isn’t btw, its actually a convenience sample of specific car models from the early ’70s)

data(mtcars)

# Rangom sample of mileage measurements
mpg_values <- mtcars$mpg

# Calculation Breakdown (using Student’s t-distribution)
n     <- length(mpg_values)  # no. samples
df    <- n - 1               # Degrees of freedom
x_bar <- mean(mpg_values)    # sample mean
s     <- sd(mpg_values)      # sample std.dev

mu_0  <- 21 # hypothesized population mean

t_stat <- (x_bar - mu_0)/(s/sqrt(n)) # t-statistic

p_value <- 2 * pt(-abs(t_stat), df = df) # Two-tailed p-value

print(t_stat)

[1] -0.8535335

print(p_value) # ~40% chance of seeing these values under the 21 mpg hypothesis

[1] 0.3999109

The probability of observing a test statistic at least as extreme as $t=-0.85$ assuming the null hypothesis is true is $\approx 40\%$. This is far more than $5%$ so this data is very extreme.

Here’s the quick practical way to do it…

t.test(mpg_values, mu = 21)


    One Sample t-test

data:  mpg_values
t = -0.85353, df = 31, p-value = 0.3999
alternative hypothesis: true mean is not equal to 21
95 percent confidence interval:
 17.91768 22.26357
sample estimates:
mean of x 
 20.09062

I hate the way R outputs the alternative hypothesis to make it look like we reject the null hypothesis - we don’t!

Two-sample pooled t-test (matching variance)

Compare the means of two independent normally distributed populations from randomly sample data with equal variance.

test statistic: \[t=\frac{\bar{x}_1-\bar{x}_2}{\sqrt{\frac{s_p^2}{n_1}+\frac{s_p^2}{n_2}}}\quad\text{where}\:\begin{cases} s_p\:\text{is the pooled sample standard deviation} \\ \bar{x}_1,\bar{x}_2\:\text{are the respective sample means}\\ n_1,n_2\:\text{are the respective sample sizes}\\ \text{the t-statistic is compared against t-distribution with }n_1+n_2−2\text{ dof} \end{cases}\] pooled standard deviation: \[s_p=\sqrt{\frac{(n_1-1)s^2_1+(n_2-1)s^2_2}{n_1+n_2-2}}\] null distribution: Student’s t-distribution

This test compares the means of two independent sets of samples, assuming the two populations are normally distributed with the same variance and different mean i.e. $X_1\sim N(\mu_1,\sigma)$ and $X_2\sim N(\mu_2,\sigma)$. The test is called “pooled” because it combines (or pools) the sample variances to estimate a common variance.

Use the pooled t-test only if the Normality assumption is met (or $n$ is large — CLT), variances are equal (should be checked with an F-test or Levene’s test). Otherwise, use Welch’s t-test, which doesn’t assume equal variances.

pooled t test test statistic:

The pooled t-test statistic is given by: \[\frac{\bar{X}_1 - \bar{X}_2}{\sigma \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} \sim \mathcal{t}_{n_1+n_2-2}\qquad\text{where}\quad\sigma^2 = \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}\]

using;

variance of expectation
variance of sum of independent variables

\[ \begin{align} &\text{data drawn from two }\textbf{populations}\text{ with differing mean and matching variance}:\\ &\qquad X_1\sim \mathcal{N}\left(\mu_1,\sigma^2\right) \quad\text{and}\quad X_2\sim \mathcal{N}\left(\mu_2,\sigma^2\right)\\ \\ &\text{...but expectation variances won't match even under }H_0\text{ if the sizes of the datasets differ...}\\ &\qquad\bar{X}_1\sim \mathcal{N}\left(\mu_1,\sigma^2/n_1\right)\quad\text{and}\quad\bar{X}_2\sim \mathcal{N}\left(\mu_2,\sigma^2/n_2\right)\quad\text{(from variance of expectation)}\\ &\qquad\frac{\bar{X}_1-\hat{\mu}_1}{\hat{\sigma}/\sqrt{n_1}}\sim \mathcal{t}_{n_1-1} \quad\text{and}\quad\frac{\bar{X}_1-\hat{\mu}_2}{\hat{\sigma}/\sqrt{n_2}}\sim \mathcal{t}_{n_2-1}\qquad\text{(standardised)}\\ \end{align} \]

\[ \begin{align} &\text{sampling distribution of mean diference estimator has a similar form,}\\ &\qquad\mathbb{E}[\bar{X}_1-\bar{X}_2]=\mathbb{E}[\bar{X}_1]-\mathbb{E}[\bar{X}_2]=\mu'\qquad\text{(linearity of expectation)}\\ &\qquad\operatorname{Var}[\bar{X}_1-\bar{X}_2]=\operatorname{Var}[\bar{X}_1]+\operatorname{Var}[\bar{X}_2]=\frac{\sigma^2}{n_1}+\frac{\sigma^2}{n_2}=\sigma'\qquad\text{(variance of independent sum)}\\ &\quad\therefore\quad\mathbb{E}[X_1]-\mathbb{E}[X_2]\sim\mathcal{N}\left(\mu',{\sigma'}^2\right)\qquad\text{(population distribution)}\\ &\quad\therefore\quad\bar{X}_1-\bar{X}_2\sim\mathcal{t}_{n_1+n_2-2}\left(\mu',{\sigma'}^2\right)\qquad\text{(sampling distribution)}+1 \operatorname{dof} \text{ since mean not estimated under }H_0 \end{align} \]

\[ \begin{align} &\:\text{common variance estimated by the }\textbf{pooled}\text{ sample variance,}\\ &\;\text{this is just the weighted average of two sample variances...}\\\\ &\qquad\frac{\left(\bar{X}_1 - \bar{X}_2\right)-\hat{\mu}'}{\hat{\sigma}'\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \sim \mathcal{t}_{n_1+n_2-2}\quad\text{where}\quad\hat{\sigma}'^2=\frac{(n_1-1)\hat{\sigma}_1^2+(n_2-1)\hat{\sigma}_2^2}{n_1+n_2-2}\\\\ &\text{under }H_0:\hat{\mu}_1=\hat{\mu}_2\text{, and linearity of expectation: }\hat{\mu}'=0:\\\\ &\qquad\frac{\bar{X}_1 - \bar{X}_2}{\hat{\sigma}'\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \sim \mathcal{t}_{n_1+n_2-2}\\\\ &\text{we can expres the denominator more simply as...}\\\\ &\qquad\frac{\bar{X}_1 - \bar{X}_2}{\operatorname{SE}} \sim \mathcal{t}_{n_1+n_2-2}\qquad\text{where}\quad\operatorname{SE} = \sqrt{ \hat{\sigma}'^2 \left( \frac{1}{n_1} + \frac{1}{n_2} \right) }\quad\text{and} \end{align} \]

toy example: does automatic transmission affect fuel economy?

with(mtcars, t.test(mpg[am == 0], mpg[am == 1]))


    Welch Two Sample t-test

data:  mpg[am == 0] and mpg[am == 1]
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -11.280194  -3.209684
sample estimates:
mean of x mean of y 
 17.14737  24.39231

# ...or...
t.test(mpg ~ am, data = mtcars)


    Welch Two Sample t-test

data:  mpg by am
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -11.280194  -3.209684
sample estimates:
mean in group 0 mean in group 1 
       17.14737        24.39231

older automatics generally used more fuel than manual transmissions due to factors like the torque converter and fewer available gears. Modern automatics often have more gears and improved technology, leading to fuel economy figures comparable to or even better than manuals.

Two-sample Welch’s t-test (different variance)

Compare the means of two independent normally distributed populations from randomly sample data with different variance. (more common and safer than polled t-test in practice)

test statistic: \[t=\frac{\bar{x}_1-\bar{x}_2}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}\quad\text{where}\:\begin{cases} s_1,s_2\:\text{are the respective sample standard deviations} \\ \bar{x}_1,\bar{x}_2\:\text{are the respective sample means}\\ n_1,n_2\:\text{are the respective sample sizes}\\ \text{the t-statistic is compared against t-distribution with (typically) fractional dof (see below...)} \end{cases}\]

Welch–Satterthwaite approximation: \[\text{df} = \frac{ \left( \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} \right)^2 }{ \frac{\left( \frac{s_1^2}{n_1} \right)^2}{n_1 - 1} + \frac{\left( \frac{s_2^2}{n_2} \right)^2}{n_2 - 1} }\]

null distribution: Student’s t-distribution

# Extract the two groups
auto  <- mtcars$mpg[mtcars$am == 0]
manual <- mtcars$mpg[mtcars$am == 1]

# Sample sizes
n1 <- length(auto)
n2 <- length(manual)

# Means
m1 <- mean(auto)
m2 <- mean(manual)

# Variances
s1 <- var(auto)
s2 <- var(manual)

# t-statistic (Welch’s formula)
t_stat <- (m1 - m2) / sqrt(s1/n1 + s2/n2)

# Degrees of Freedom (Welch–Satterthwaite Equation)
df <- (s1/n1 + s2/n2)^2 / ((s1^2)/(n1^2 * (n1 - 1)) + (s2^2)/(n2^2 * (n2 - 1))) # fractional btw

# p-value from t-distribution
p_val <- 2 * pt(-abs(t_stat), df)

print(t_stat)

[1] -3.767123

print(p_val)

[1] 0.001373638

Paired t-test
Used for mean comparison on matched samples (e.g., before/after measurements or repeated measures on the same subjects). Tests whether the mean of the differences between paired observations is zero.

test statistic: \[t = \frac{\bar{d}}{s_d / \sqrt{n}}\quad\text{where}\:\begin{cases} d_i=x_i-y_i\:\text{is the paired difference} \\ s_d\:\text{is the standard deviation of the differences} \\ \bar{d}\:\text{is the mean of the differences}\\ n\:\text{number of pairs}\\ \text{the t-statistic is compared against the }n-1\text{ t-distribution} \end{cases}\] standard deviation: \[s_d = \sqrt{ \frac{1}{n - 1} \sum_{i=1}^{n} (d_i - \bar{d})^2 }\]

null distribution: Student’s t-distribution

This is simply reducing the paired problem to a one-sample t-test on the differences, testing: \[H_0: \mu_d = 0 \quad \text{vs.} \quad H_1: \mu_d \neq 0\]

does the effect of two soporific drugs (increase in hours of sleep compared to control) on 10 patients. 3 populations; control, drug 1 and drug 2…

## Paired t-test
## The sleep data is actually paired, so could have been in wide format:
sleep2 <- reshape(sleep, direction = "wide",
                  idvar = "ID", timevar = "group")

## Traditional interface
t.test(sleep2$extra.1, sleep2$extra.2, paired = TRUE)


    Paired t-test

data:  sleep2$extra.1 and sleep2$extra.2
t = -4.0621, df = 9, p-value = 0.002833
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -2.4598858 -0.7001142
sample estimates:
mean difference 
          -1.58

z-test - compare means of two known sampled normal distributions

Uses the standard normal distribution $N(0,1)$ to determine the probability of observing scenario at least extreme as the one observed.

used when we’re comparing means (sample vs. population or two samples)
the populations are normally distributed (typically a sum of random variables in accordance with the CTL)
the population variance is known (very rarely the case albiet if the sample size is large enough $n\ge30$ we can approximate the t-test with a z-test)

One-sample z-test
Test whether the mean of a samples from a single population differs from a known or hypothesized population with normal pdf distribution

test statistic: \[z=\frac{\bar{x}-\mu}{\sigma/\sqrt{n}}\quad\text{where}\:\begin{cases} \sigma\:\text{is the population standard deviation} \\ \mu\:\text{is the population mean}\\ \bar{x}\:\text{is the sample mean}\\ n\:\text{is the sample size}\\ \text{the t-statistic is compared against t-distribution with }n−1\text{ dof} \end{cases}\]

null distribution: Standard Normal Distribution $z\sim N(0,1)$

A factory produces light bulbs, and the manufacturer claims that the average lifespan of a light bulb is 1,000 hours. A sample of 50 light bulbs from a new batch has a sample mean lifespan of 1,020 hours. The population standard deviation is known to be 100 hours. Does this sample show a significant difference from the claimed mean?

Null hypothesis $H_0: \mu=1000$ (the population mean lifespan is 1000 hours).
Alternative hypothesis $H_0: \mu\ne1000$ (the population mean lifespan is not 1000 hours).

\[z=\frac{1020-1000}{100/\sqrt{50}}\approx1.41\]

z <- (1020-1000)/(100/sqrt(50))

pnorm(z,0,1,lower.tail = TRUE) # left 0.92 so this value lies to the right of the distribution

[1] 0.9213504

pnorm(z,0,1,lower.tail = FALSE) # right 0.08 so this is quite extreme

[1] 0.0786496

p <- 2*pnorm(z,0,1,lower.tail = FALSE) # p-value = 0.08, this is low but exceeds 0.05
print(p)

[1] 0.1572992

The p-value $p = 0.08$ exceeds the significance level $\alpha = 0.05$ so we fail to reject the null hypothesis there is no significant evidence to dispute the manufacturer’s claim.

Two-Sample Z-test (Independent Z-test)
Test whether the mean of a single sample differs from a known or hypothesized population with normal pdf distribution

test statistic: \[z=\frac{\bar{x}_1-\bar{x}_2}{\sqrt{\frac{\sigma^2}{n_1}+\frac{\sigma^2}{n_2}}}\quad\text{where}\:\begin{cases} \sigma\:\text{is the known population standard deviation} \\ \bar{x}_1,\bar{x}_2\:\text{are the respective sample means}\\ n_1,n_2\:\text{are the respective sample sizes}\\ \text{the t-statistic is compared against t-distribution with }n_1+n_2−2\text{ dof} \end{cases}\] pooled standard deviation: \[s_p=\sqrt{\frac{(n_1-1)s^2_1+(n_2-1)s^2_2}{n_1+n_2-2}}\] null distribution: Standard Normal Distribution $z\sim N(0,1)$

You want to compare the average test scores of students from two schools. School A has 40 students with a mean score of 85, and School B has 50 students with a mean score of 80. The population standard deviations for both schools are known?!: School A ($\sigma_A=10$) and School B ($\sigma_B=12$).

Null hypothesis $H_0:\mu_A=\mu_B$ (the average scores from both schools are equal)
Alternative hypothesis $H_A:\mu_A\ne\mu_B$ (the average scores from both schools are different)

\[z=\frac{\bar{x}_1-\bar{x}_2}{\sqrt{\frac{\sigma^2}{n_1}+\frac{\sigma^2}{n_2}}}=\frac{85-80}{\sqrt{\frac{10^2}{40}+\frac{12^2}{50}}}\approx 2.16\]

z <- (85-80)/sqrt((10^2/40)+(12^2/50))

pnorm(z,0,1,lower.tail = TRUE) # left 0.98 so the value lies to the right of the distribution

[1] 0.9844446

pnorm(z,0,1,lower.tail = FALSE) # right 0.015 so this is very extreme

[1] 0.01555538

p <- 2*pnorm(z,0,1,lower.tail = FALSE)
print(p)

[1] 0.03111077

The p-value $p = 0.02$ is below the significance level $\alpha = 0.05$ so we reject the null hypothesis there is significant evidence that the schools are performing differently.

Z-test for Proportions (One-Sample Proportion Z-test)
Compares a sample proportion to a known population proportion.

test statistic: \[z=\frac{\hat{p}-p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}\quad\text{where}\:\begin{cases} \hat{p}\:\text{is the sample proportion} \\ p_0\:\text{is the population proportion}\\ n\:\text{is the sample size} \end{cases}\] null distribution: Standard Normal Distribution $z\sim N(0,1)$

Assumes the sample size is large enough such that both $np$ and $n(1−p)$ are greater than 5 (where $p$ is the population proportion).

In a survey of 400 voters, 250 say they support a particular candidate. You want to know if the proportion of voters supporting the candidate is significantly different from 0.60 (i.e., 60%).

Null hypothesis $H_0:p=0.6$ (the population proportion of supporters is 60%)
Alternative hypothesis $H_A:p\ne0.6$ (the population proportion of supporters is not 60%)

\[z=\frac{\hat{p}-p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}=\frac{\frac{250}{400}-0.6}{\sqrt{\frac{0.6(1-0.6)}{400}}}\approx 1.02\]

z <- ((250/400)-0.6)/sqrt(0.6*(1-0.6)/400)

pnorm(z,0,1,lower.tail = TRUE) # left 0.85 so the value lies to the right of the distribution

[1] 0.8462829

pnorm(z,0,1,lower.tail = FALSE) # right 0.15 so this is ok

[1] 0.1537171

p <- 2*pnorm(z,0,1,lower.tail = FALSE)
print(p)

[1] 0.3074342

The p-value $p = 0.31$ far exceeds the significance level $\alpha = 0.05$ so we fail to reject the null hypothesis there is no significant evidence to dispute that the proportion of voters supporting the candidate is significantly different from the expected 60%.

Two-Sample Z-test for Proportions
Compares a sample proportion to a known population proportion.

test statistic: \[z=\frac{\hat{p}_1-\hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})}\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}\quad\text{where}\:\begin{cases} \hat{p}_1,\hat{p}_2\:\text{are the sample proportions} \\ \hat{p}\:\text{is the combined proportion of successes in both samples} \\ n_1,n_2\:\text{are the sample sizes} \end{cases}\] null distribution: Standard Normal Distribution $z\sim N(0,1)$

Assumes both the the sample sizes meet the normal approximation i.e. $n_i=n$ is large enough such that both $np$ and $n(1−p)$ are greater than 5 (where $p$ is the population proportion).

You want to compare the proportions of male and female voters supporting a particular candidate. In a sample of 200 males, 130 support the candidate, and in a sample of 300 females, 180 support the candidate.

Null hypothesis $H_0:p_1=p_2$ (the proportion of male and female voters supporting the candidate is the same)
Alternative hypothesis $H_A:p_1\ne p_2$ (the proportions are different)

\[z=\frac{\hat{p}_1-\hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})}\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}=\frac{\frac{130}{200}-\frac{180}{300}}{\sqrt{\frac{130+180}{200+300}(1-\frac{130+180}{200+300})}\left(\frac{1}{200}+\frac{1}{300}\right)}\approx 1.13\]

alpha <- (130+180)/(200+300)
z <- ((130/200)-(180/300))/sqrt(alpha*(1-alpha)*((1/200)+(1/300)))

pnorm(z,0,1,lower.tail = TRUE) # left 0.87 so the value lies to the right of the distribution

[1] 0.8704299

pnorm(z,0,1,lower.tail = FALSE) # right 0.12 so this is ok

[1] 0.1295701

p <- 2*pnorm(z,0,1,lower.tail = FALSE)
print(p)

[1] 0.2591402

The p-value $p = 0.26$ exceeds the significance level $\alpha = 0.05$ so we fail to reject the null hypothesis there is no significant evidence to dispute that women and men significantly differ in their opinion of the candidate.

F-Test TODO

Mann-Whitney U Test TODO - Compare medians

Mann-Whitney U Test (Wilcoxon Rank-Sum)
Compare medians of two independent samples without assuming normality.

test statistic: \[U=\min(U_1,U_2)\quad\text{where}\:\begin{cases} R_1=\text{sum of ranks for sample }A\\ U_1=R_1-\frac{n_1(n_1+1)}{2}\\ U_2=n_1n_2-U_1 \end{cases}\] null distribution: The Mann-Whitney U distribution which is approximately normally distributed for lare sample sizes.

Used to test whether two independent groups come from the same distribution when the assumptions of the independent two-sample t-test (normality, equal variances) are not met.

Null hypothesis $H_0$: The distributions of both groups are equal (same median)
Alternative hypothesis $H_A$: The distributions differ (medians are not equal)

Are the scores of two groups significantly different?

Group A: 85, 90, 88, 75
Group B: 70, 65, 80, 60

groupA <- c(85, 90, 88, 75)
groupB <- c(70, 65, 80, 60)

wilcox.test(groupA, groupB,
            alternative = "two.sided",
            exact = FALSE)  # exact = FALSE for larger samples or ties


    Wilcoxon rank sum test with continuity correction

data:  groupA and groupB
W = 15, p-value = 0.0606
alternative hypothesis: true location shift is not equal to 0

Categorical Data

Chi-squared test - compare sample to a distribution or categorical data from different contexts

Uses the chi-squared distribution $\chi^2_\nu$ with degrees of freedom $mu$ based on sample size to determine the probability of observing scenario at least extreme as the one observed..

used when we’re testing categorical data (like frequencies or variances) ~ really the sum of multiple normal distributuions (e.g testing if observed frequencies differ from expected)
testing if variance of a normal distribution differs from a known value

ALWAYS right skewed since chi squared test only measures total deviation, not direction!!

Assumes errors are normally distributed when used for variance testing

Chi-squared Test of Independence
test whether two categorical variables are independent of each other.

test statistic: \[\chi^2=\sum\frac{(O-E)^2}{E}\quad\text{where}\:\begin{cases} O\:\text{observed value} \\ E\:\text{expected value}\\ \quad E=\frac{\text{row total}\times\text{column total}}{\text{grand total}}\\ \chi^2\:\text{statistic is used to calculate the p-value from the} \\ \quad \text{cummulative upper tail chi-squared distrivution where...}\\ \quad \text{degrees of freedom }df=(\text{No. Rows}−1)\cdot(\text{No. Columns}−1) \end{cases}\] null distribution: pchisq Chi Squared Distribution $\chi^2_\mu$

Is there an association between education level and job sector?

observed education \ sector	Tech	Non-Tech	Total
Bachelor’s	30	20	50
Master’s	40	10	50
Total	70	30	100

Null hypothesis $H_0:$ Education level and job sector are independent
Alternative hypothesis $H_A:$ They are dependent

expected education \ sector	Tech	Non-Tech	Total
Bachelor’s	$0.7\cdot 50$	$0.3\cdot 50$	50
Master’s	$0.7\cdot 50$	$0.3\cdot 50$	50
Total	70	30	100

\[ \begin{align} \chi^2&=\sum\frac{(O-E)^2}{E}\\ &=\frac{(30-0.7\cdot50)^2}{0.7\cdot50}+\frac{(20-0.3\cdot 50)^2}{0.3\cdot 50}+\frac{(40-0.7\cdot 50)^2}{0.7\cdot 50}+\frac{(10-0.3\cdot 50)^2}{0.7\cdot 50}\\ &\approx 4.8 \end{align} \]

observed <- matrix(c(30, 20, 40, 10), nrow = 2, byrow = TRUE)
expected <- matrix(c(0.7*50, 0.3*50, 0.7*50, 0.3*50), nrow = 2, byrow = TRUE)
chisq.stat <-sum((observed-expected)^2/expected)

# ALWAYS right skewed since chi squared test only measures total deviation, not direction!!
p <- pchisq(chisq.stat, df = 1, lower.tail = FALSE) # right 0.03 so this is low

print(p)

[1] 0.02909633

Alternatively…

observed <- matrix(c(30, 20, 40, 10), nrow = 2, byrow = TRUE)
dimnames(observed) <- list(Education = c("Bachelors", "Masters"),
                           Sector = c("Tech", "Non-Tech"))

chisq.test(observed, correct= FALSE) # i should run with Yate's correction i guess


    Pearson's Chi-squared test

data:  observed
X-squared = 4.7619, df = 1, p-value = 0.0291

The p-value $p = 0.05$ is at the significance level $\alpha = 0.05$ so we reject the null hypothesis there is significant evidence to show a dependence between the two variables.

Chi-squared Goodness-of-Fit Test
test if an observed distribution matches an expected (theoretical) distribution.

test statistic: \[\chi^2=\sum\frac{(O_i-E_i)^2}{E_i}\quad\text{where}\:\begin{cases} O_i\:\text{observed value for category }i \\ E\:\text{expected value for category }i \text{ following expected distribution} \\ \chi^2\:\text{statistic is used to calculate the p-value from the} \\ \quad \text{cummulative upper tail chi-squared distrivution where...} \\ \quad \text{degrees of freedom }df= \end{cases}\] null distribution: pchisq Chi Squared Distribution $\chi^2_\mu$

Are Rowentree’s Fruitpastels packed in equal colours?

Observed: Red = 15, Blue = 25, Green = 20

Null hypothesis $H_0:$ They follow a uniform distribution
Alternative hypothesis $H_A:$ They don’t follow a uniform distribution

Expected: All equal (20 each)

\[ \begin{align} \chi^2&=\sum\frac{(O_i-E_i)^2}{E_i}\\ &=\frac{(15-20)^2}{20}+\frac{(25-20)^2}{20}+\frac{(20-20)^2}{20}\\ &= 2.5 \end{align} \]

observed <- c(15, 25, 20)
expected <- rep(20, 3)
chisq.stat <-sum((observed-expected)^2/expected)

# ALWAYS right skewed since chi squared test only measures total deviation, not direction!!
p <- pchisq(chisq.stat, df = 2, lower.tail = FALSE) # right 0.29

print(p)

[1] 0.2865048

Alternatively…

observed <- c(15, 25, 20)
expected <- rep(20, 3)
chisq.test(x = observed, p = expected / sum(expected))


    Chi-squared test for given probabilities

data:  observed
X-squared = 2.5, df = 2, p-value = 0.2865

The p-value $p = 0.29$ is high and far above the significance level $\alpha = 0.05$ so we fail to reject the null hypothesis - there is not significant evidence to show Fruitpastels colour variety deviate from uniform probability distribution.

Super Mario Piranha Plant Example

Does the Super Mario Piranha Plant use the hypothosised gear mechanism?

Null hypothesis $H_0:$ They follow a uniform distribution with $U(1,13)$
Alternative hypothesis $H_A:$ They don’t follow a uniform distribution

Expected: All equal (20 each)

observed <- c(6, 12, 11, 10, 7, 6, 8, 7, 8, 10, 6, 12, 9, 10, 9, 11, 10, 12,
              6, 9, 10, 1, 7, 12, 7, 9, 6, 11, 7, 7, 6, 7, 1, 13, 1, 6, 5, 8,
              9, 10, 11, 1, 7, 9, 2, 11, 11, 12, 9, 11, 13, 9, 2, 12, 2, 1, 12,
              10, 5, 9, 13, 1, 1, 6, 9, 4, 10, 2, 10, 2, 9, 13, 10, 9, 10, 9,
              11, 9, 8, 8, 3, 1, 10, 10, 11, 11, 9, 10, 9, 10, 8, 9, 12, 13,
              13, 12, 2, 8, 12, 4, 9, 10, 1, 10)

# after opening, drinking, not using the stem to recharge
observed <- c(observed, 
c(2,7,11,7,11,8,8,11,8,7,12,3,8,2,6,5,3,7,4,3,2,2,6,9,7,5,1,1,6,10,6,4,9,10,1,2,
  1,6,9,9,12,7,3,2,5,3,5,7,1,11,4,3,5,11,4,1,6,7,2,5,2,5,3,3,1,3,11,9,5,8,6,5,3,
  12,1,1,5,5,8,2,5,12,5,12,11,5,8,12,5,13,2,11,10,4,3,2,1,4,11,8,12,13,13,1,10,
  3,12,10,3,1,1,1,12,13,3,1,13,9,11,1,10,2,3,10,13,4,3,11,10,2,13,3,3,12,9,1,11,
  11,1,10,7,10,10,8,2,11,11,5,7,6,2,13,1,10,6,9,10,12,9,12,5,3,9,1,3,3,5,1,2,3,
  8,10,12,3,13,9,13,7,9,4,8,1,2,10,11,7,9,2,10,6,5,8,10,9,10,1,10,11,10,5,12,8)
)

# round 3
observed <- c(observed, 
c(12,9,12,6,4,11,2,4,2,5,2,2,13,2,6,5,3,3,5,2,2,4,4,13,10,13,13,12,13,10,13,1,
  10,2,12,3,12,3,12,2,12,4,1,12,6,11,4,2,4,11,4,5,10,3,5,11,2,7,5,2,11,3,11,12,
  3,11,12,8,10,11,11,4,5,5,7,7,6,4,12,4,7,2,3,2,13,6,2,12,5,5,2,13,12,2,13,3,2,11)
)

# Tabulate counts:
tab <- table(observed)

# Expected: uniform across the categories
expected <- rep(sum(tab) / length(tab), length(tab))

# Chi-squared statistic
chisq.stat <- sum((tab - expected)^2 / expected)

# Degrees of freedom = (no. of categories) - 1
df <- length(tab) - 1

# P-value ALWAYS right skewed since chi squared test only measures total deviation, not direction!!
p <- pchisq(chisq.stat, df = df, lower.tail = FALSE)

print(p)

[1] 0.03853073

barplot(table(observed), 
        main="Frequency of Values", 
        xlab="Values", 
        ylab="Frequency", 
        col="lightblue", 
        border="darkblue")

Chi-squared Test for Homogeneity (test of independence for more than 2 vars)
test if different populations have the same distribution of a categorical variable.

Are product preferences the same in 3 cities?

Observed Product\City	Inverness	Dundee	Aberdeen
Apple	30	40	35
Samsung	20	25	30

Null hypothesis $H_0:$ Preferences are the same across all cities
Alternative hypothesis $H_A:$ Preferences differ by city

Expected Product\City	Inverness	Dundee	Aberdeen
Apple	$\frac{105\cdot 50}{180}=$	$\frac{105\cdot 65}{180}$	$\frac{105\cdot 65}{180}$
Samsung	$\frac{75\cdot 50}{180}$	$\frac{75\cdot 65}{180}$	$\frac{75\cdot 65}{180}$

\[ \begin{align} \chi^2&=\sum\frac{(O_i-E_i)^2}{E_i}\\ &= 0.87 \end{align} \]

\[ df = (2-1)\cdot(3-1)=2 \]

observed <- matrix(c(30, 40, 35, 20, 25, 30), nrow = 2, byrow = TRUE)
expected <- matrix(c(105*50/180,105*65/180,105*65/180,75*50/180,75*65/180,75*65/180), nrow = 2, byrow = TRUE)
chisq.stat <-sum((observed-expected)^2/expected)

# ALWAYS right skewed since chi squared test only measures total deviation, not direction!!
p <- pchisq(chisq.stat, df = 2, lower.tail = FALSE) # right 0.03 so this is low

print(p)

[1] 0.647158

Alternatively…

observed <- matrix(c(30, 40, 35, 20, 25, 30), nrow = 2, byrow = TRUE)
dimnames(observed) <- list(Education = c("Apple", "Samsung"),
                           Sector = c("Inverness", "Dundee", "Aberdeen"))

chisq.test(observed, correct= FALSE) # i should run with Yate's correction i guess


    Pearson's Chi-squared test

data:  observed
X-squared = 0.87033, df = 2, p-value = 0.6472

The p-value $p = 0.65$ is high and far above the significance level $\alpha = 0.05$ so we fail to reject the null hypothesis - there is not significant evidence to show product preference varies between the cities.

Comparing Variances

F-test - Compare two variances

often used before t-tests but it really does fuck all - only catching variances that are seriously out - use a weltch t-test and grow up.

test statistic: \[F=\frac{s_1^2}{s_2^2}\quad\text{where}\:\begin{cases} s_1,s_2\:\text{sample std.dev} \\ df_1=n_1-1:\text{degrees of freedom for sample size }n_1\\ df_2=n_2-1:\text{degrees of freedom for sample size }n_2\\ \end{cases}\] null distribution: pf F-distribution $F_{df_1,df_2}$

Two machines produce metal rods. We want to test whether the variability in lengths (i.e., variances) differs between the two machines.

Machine A: $s_1^2=4.2^2=17.64$, sample size $n_1=15$ Machine B: $s_2^2=3.1^2=9.61$, sample size $n_2=12$

Null hypothesis $H_0:\sigma_1^2=\sigma^2_2$ (equal population variances)
Alternative hypothesis $H_A:\sigma_1^2\ne\sigma^2_2$ (different population variances)

s1sq <- 17.64
s2sq <- 9.61
f <- s1sq / s2sq  # = 1.84
df1 <- 15-1
df2 <- 12-1

p_value <- 2 * min(
  pf(f, df1, df2, lower.tail = FALSE),  # upper tail
  pf(f, df1, df2, lower.tail = TRUE)   # lower tail (in case F < 1)
)
print(p_value)

[1] 0.3163461

var.test(x = rnorm(15, sd = 4.2), y = rnorm(12, sd = 3.1))


    F test to compare two variances

data:  rnorm(15, sd = 4.2) and rnorm(12, sd = 3.1)
F = 2.6988, num df = 14, denom df = 11, p-value = 0.1048
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.803501 8.351701
sample estimates:
ratio of variances 
          2.698807

The p-value $p = 0.8782$ is high and far above the significance level $\alpha = 0.05$ so we fail to reject the null hypothesis - there is not significant evidence to show the population variances differ.

TODO: - Levene’s test Compare variances without assuming normality (more robust than F) - Bartlett’s test Like Levene, but assumes normality — for 2+ groups

Correlation & Regression

Pearson correlation test - test linear relationship between two continuous variables

dig out the C++ implementation I used to validate dysis camera calibration data in the sys_info validation step

Pearson correlation coefficient: \[r=\frac{\sum{(x_i-\bar{x})(y_i-\bar{y})}}{\sum{(x_i-\bar{x})^2}\sum{(y_i-\bar{y})^2}}\quad\text{where}\:\begin{cases} r=+1\:\text{perfect positive correlation} \\ r=-1:\text{perfect negative correlation} \\ r=0:\text{no linear relationship} \end{cases}\]

test statistic: \[t=\frac{r\sqrt{n-2}}{\sqrt{1-r^2}}\quad\text{where}\:\begin{cases} r\:\text{Pearson correlation coefficient}\\ n\:\text{is the sample size}\\ \text{the t-statistic is compared against t-distribution with }n−2\text{ dof} \end{cases}\]

null distribution: Student’s t-distribution

Two machines produce metal rods. We want to test whether the variability in lengths (i.e., variances) differs between the two machines.

Machine A: $s_1^2=4.2^2=17.64$, sample size $n_1=15$ Machine B: $s_2^2=3.1^2=9.61$, sample size $n_2=12$

Null hypothesis $H_0:\sigma_1^2=\sigma^2_2$ (equal population variances)
Alternative hypothesis $H_A:\sigma_1^2\ne\sigma^2_2$ (different population variances)

x <- c(1, 2, 3, 4, 5)
y <- c(2, 4, 5, 4, 5)

cor.test(x, y, method = "pearson")


    Pearson's product-moment correlation

data:  x and y
t = 2.1213, df = 3, p-value = 0.124
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.3400820  0.9842358
sample estimates:
      cor 
0.7745967

Spearman correlation test - test monotonic non-linear relationship for ranked data

Rather than working with raw values, it converts the data to ranks and then computes the Pearson correlation of the ranks.

Spearman correlation coefficient (for no ties, generally more complex): \[\rho_s=1-\frac{6\sum d_i^2}{n(n^2-1)}\quad\text{where}\:\begin{cases} d_i\:\text{is the difference between the ranks of each pair} \\ n\:\text{number of pairs} \end{cases}\]

test statistic: \[t=\frac{\rho_s\sqrt{n-2}}{\sqrt{1-\rho_s^2}}\quad\text{where}\:\begin{cases} r\:\text{Pearson correlation coefficient}\\ n\:\text{is the sample size}\\ \text{the t-statistic is compared against t-distribution with }n−2\text{ dof} \end{cases}\]

null distribution: Student’s t-distribution

Two machines produce metal rods. We want to test whether the variability in lengths (i.e., variances) differs between the two machines.

Machine A: $s_1^2=4.2^2=17.64$, sample size $n_1=15$ Machine B: $s_2^2=3.1^2=9.61$, sample size $n_2=12$

Null hypothesis $H_0:\rho_s=0$ (no monotonic relationship)
Alternative hypothesis $H_A:\rho_s\ne0$ (there is a monotonic relationship)

x <- c(10, 20, 30, 40, 50)
y <- c(1, 2, 3, 6, 5)

cor.test(x, y, method = "spearman")


    Spearman's rank correlation rho

data:  x and y
S = 2, p-value = 0.08333
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho 
0.9

Linear regression - determine if a linear relationship exists

slope $\beta_1$: \[y=\beta_0+\beta_1 x+\epsilon\quad\text{where}\:\begin{cases} \beta_0\:\text{is the y-intercept} \\ \beta_1\:\text{is the gradient} \\ \epsilon\:\text{is the random error} \end{cases}\]

test statistic: \[t=\frac{\hat{\beta}_1}{SE(\hat{\beta}_1)}\quad\text{where}\:\begin{cases} \hat{\beta}_1\:\text{is the estimated slope for the sample}\\ SE(\hat{\beta}_1)\:\text{is the standard error of the slope}\\ \text{the t-statistic is compared against t-distribution with }n−2\text{ dof} \end{cases}\]

null distribution: Student’s t-distribution

x <- c(1, 2, 3, 4, 5)
y <- c(2, 4, 5, 4, 5)

model <- lm(y ~ x)
summary(model)


Call:
lm(formula = y ~ x)

Residuals:
   1    2    3    4    5 
-0.8  0.6  1.0 -0.6 -0.2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   2.2000     0.9381   2.345    0.101
x             0.6000     0.2828   2.121    0.124

Residual standard error: 0.8944 on 3 degrees of freedom
Multiple R-squared:    0.6, Adjusted R-squared:  0.4667 
F-statistic:   4.5 on 1 and 3 DF,  p-value: 0.124

The p-value $p = 0.12$ is high and far above the significance level $\alpha = 0.05$ so we fail to reject the null hypothesis - there is not very significant evidence to show that a linear relationship exists.

One Way ANOVA (Analysis of Variance) - Compare means of 3 or more groups to check if any come from different populations

test statistic: \[F=\frac{\text{between group variability}}{\text{within group variability}}=\frac{MS_{between}}{MS_{within}}\quad\text{where}\:\begin{cases} MS_{between}=\frac{SS_{between}}{df_{between}}=\frac{\text{sum of squares}}{\text{degrees of freedom}} \\ df_{between}=k-1\\ \\ MS_{within}=\frac{SS_{within}}{df_{within}}=\frac{\text{sum of squares}}{\text{degrees of freedom}} \\ df_{within}=N-k\\ \\ k\:\text{is the number of groups} \\ N\:\text{is the total number of observations} \end{cases}\]

null distribution: pf F-distribution $F_{df_1,df_2}$

group <- factor(
          c("A", "A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C"))
scores <- c( 85,  88,  87,  87,  89,  90,  91,  87,  87,  89,  80,  79)

model <- aov(scores ~ group)
summary(model)

            Df Sum Sq Mean Sq F value Pr(>F)  
group        2  66.70   33.35   3.651  0.069 .
Residuals    9  82.22    9.14                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value $p = 0.00875$ is high and far below the significance level $\alpha = 0.05$ so we reject the null hypothesis - at least one group mean differs - let’s see whuch mean differs using a tuckey test.

TukeyHSD(model)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = scores ~ group)

$group
         diff        lwr       upr     p adj
B-A  1.550000  -4.110844 7.2108445 0.7328741
C-A -4.533333 -10.696080 1.6294135 0.1550208
C-B -6.083333 -12.528488 0.3618216 0.0638280

boxplot(scores ~ group,
        col = "skyblue",
        main = "Group Comparison via Boxplot",
        xlab = "Group",
        ylab = "Scores")

TODO: - Kruskal-Wallis test Non-parametric version of ANOVA

Hypothesis Dredging Avoidance

The Bonferroni correction is a method used to control the family-wise error rate (FWER) when performing multiple hypothesis tests. Given $m$ independent tests and a desired overall significance level $\alpha$, the corrected significance level for each individual test is:

\[ \alpha' = \frac{\alpha}{m} \]

A hypothesis test is considered statistically significant only if its p-value satisfies $p < \alpha'$. This approach is straightforward and conservative, reducing the likelihood of Type I errors, but potentially increasing the risk of Type II errors when the number of tests is large.

Appendix

Basic Statistical Properties

Expectation

linearity of expectation

\[\mathbb{E}[X+Y]=\mathbb{E}[X]+\mathbb{E}[Y]\]

\[ \begin{align} \mathbb{E}[X+Y] &= \int_{-\infty}^{\infty}\left(x(q) + y(q)\right)\operatorname{pdf}(q)\:dq\\ &= \int_{-\infty}^{\infty} x(q)\operatorname{pdf}(q)\:dq + \int_{-\infty}^{\infty} y(q) \operatorname{pdf}(q)\:dq\\ &= \mathbb{E}[X]+\mathbb{E}[Y] \end{align} \]

\[ \begin{align} \mathbb{E}[X+Y] &= \sum_{i=0}^n\left(x(\omega_i) + y(\omega_i)\right)\operatorname{pmf}(\omega_i)\\ &= \sum_{i=0}^n x(\omega_i) \operatorname{pmf}(\omega_i) + \sum_{i=0}^n y(\omega_i) \operatorname{pmf}(\omega_i)\\ &= \mathbb{E}[X]+\mathbb{E}[Y] \end{align} \]

scaling and shifting of expectation

\[ \mathbb{E}[aX+b]=a\mathbb{E}[X]+b \]

\[ \begin{align} \mathbb{E}[aX+b] &= \int_{-\infty}^{\infty}\left(a x(q) + b\right)\operatorname{pdf}(q)\:dq\\ &= \int_{-\infty}^{\infty} ax(q)\operatorname{pdf}(q)+b \operatorname{pdf}(q)\:dq\\ &= a\int_{-\infty}^{\infty} x(q)\operatorname{pdf}(q)\:dq+ b\int_{-\infty}^{\infty} \operatorname{pdf}(q)\:dq\\ &= a\mathbb{E}[X]+b \end{align} \]

\[ \begin{align} \mathbb{E}[aX+b] &= \sum_{i=0}^n\left(a x(\omega_i) + b\right)\operatorname{pmf}(\omega_i)\\ &= \sum_{i=0}^n \left( ax(\omega_i)\operatorname{pmf}(\omega_i)+b \operatorname{pmf}(\omega_i)\right)\\ &= a\sum_{i=0}^n x(\omega_i)\operatorname{pmf}(\omega_i)+ b\sum_{i=0}^n \operatorname{pmf}(\omega_i)\\ &= a\mathbb{E}[X]+b \end{align} \]

difference in expection

\[ \mathbb{E}(X-Y)=\mathbb{E}(X)-\mathbb{E}(Y) \]

Variance & Covariance

computational formula for variance (shortcut formula for variance)

\[\operatorname{Var}(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2\]

Using the definition of population variance and the linearity of expectation…

\[ \begin{align} \operatorname{Var}(X)&=\mathbb{E}[(X-\mathbb{E}[X])^2]\\ &=\mathbb{E}[X^2 - 2X\mathbb{E}[X] + (\mathbb{E}[X])^2]\\ &= \mathbb{E}[X^2]-2\mathbb{E}[X]\mathbb{E}[X]+(\mathbb{E}[X])^2\qquad\text{(linearity of expectation)}\\ &= \mathbb{E}[X^2] - \mathbb{E}[X]^2 \end{align} \]

bilinearity of covariance (definition of covariance via expectation)

\[\operatorname{Cov}(X, Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]\]

Starting with the definition of covariance and using the linearity of expectation \[ \begin{align} \operatorname{Cov}(X,Y)\equiv&\mathbb{E}\left[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])\right]\\ =&\mathbb{E}\left[XY - X\mathbb{E}[Y] - \mathbb{E}[X]Y + \mathbb{E}[X]\mathbb{E}[Y]\right]\\ =&\mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] - \mathbb{E}[X]\mathbb{E}[Y] + \mathbb{E}[X]\mathbb{E}[Y]\qquad\text{(linearity of expectation)}\\ =&\mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] \end{align} \]

shifting and scaling of variance

\[\operatorname{Cov}(X, Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]\]

using scaling of expectation… \[ \begin{align} \text{Let}\quad\mu=E[X]\quad&\text{and}\quad E[aX+b]=a\mu+b\quad\text{then}\\\\ Var(aX+b)&=E[\left(aX+b−[a\mu+b]\right)^2]\\ &=E[(aX−a\mu)^2]\\ &=E[a^2(X−μ)^2]\\ &=a^2E[(X−μ)^2]\qquad\text{(scaling of expectation)}\\ &=a^2Var(X)\\\\ \therefore\quad&\boxed{Var(aX+b) = a^2Var(X)} \end{align} \]

variance of sample mean (expectation)

\[ \begin{align} &\text{For }n\text{ i.i.d. samples }X_1, X_2, \dots, X_n:\\ &\quad \bar{X} = \frac{1}{n} \sum_{i=1}^n x_i \quad\text{(sample mean)}\\ &\quad \hat{\sigma}^2 = \frac{1}{n - 1} \sum_{i=1}^n (x_i - \bar{X})^2 \quad\text{(sample variance w/estimated mean)}\\ &\quad \operatorname{Var}(\bar{X}) = \frac{\hat{\sigma}^2}{n} \quad\text{(estimated variance of the mean)}\\ &\quad \sigma_{\bar{X}} = \sqrt{ \operatorname{Var}(\bar{X}) } = \frac{\hat{\sigma}}{\sqrt{n}} \quad\text{(standard error of the mean)} \end{align} \]

using scaling of expectation… \[ \begin{align*} &\text{Let } X_1, X_2, \dots, X_n \text{ be i.i.d. random variables}\\ &\text{where}\quad\mathbb{E}[X_i] = \mu\quad\text{and}\quad\operatorname{Var}(X_i) = \sigma^2\\\\ &\quad \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i \\\\ &\quad \operatorname{Var}(\bar{X}) = \operatorname{Var}\left( \frac{1}{n} \sum_{i=1}^n X_i \right) \\ &\qquad\;\qquad = \frac{1}{n^2} \operatorname{Var}\left( \sum_{i=1}^n X_i \right)\quad\text{(scaling}\:\operatorname{Var}(aX) = a^2 \operatorname{Var}(X)\text{)} \\ &\qquad\;\qquad = \frac{1}{n^2} \sum_{i=1}^n \operatorname{Var}(X_i) \quad\text{(linearity and independance)}\\ &\qquad\;\qquad = \frac{1}{n^2} \cdot n\sigma^2 = \frac{\sigma^2}{n}\\\\ &\therefore\qquad\boxed{\operatorname{Var}(\bar{X})=\frac{\sigma^2}{n}} \end{align*} \]

variance of a linear combination

\[ \operatorname{Var}(X + Y) = \operatorname{Var}(X) + \operatorname{Var}(Y) + 2 \operatorname{Cov}(X, Y) \]

using…

computational formular for variance
bilinearity of covariance
the linearity of expectation

\[ \begin{align} &\quad\text{computational formula for variance}\dots\\ \operatorname{Var}(X+Y)&=\mathbb{E}[(X+Y)^2]-\left(\mathbb{E}[X+Y]\right)^2\\ &=\mathbb{E}[X^2+2XY+Y^2]-\left(\mathbb{E}[X]+\mathbb{E}[Y]\right)^2\\ \\&\quad\text{by linearity of expectation}\dots\\ &=\mathbb{E}[X^2]+2\mathbb{E}[XY]+\mathbb{E}[Y^2]-\left(\mathbb{E}[X]^2+2\mathbb{E}[X]\mathbb{E}[Y]+\mathbb{E}[Y]^2\right)\\ &=\left(\mathbb{E}[X^2]-\mathbb{E}[X]^2\right)+\left(\mathbb{E}[Y^2]-\mathbb{E}[Y]^2\right)+2\left(\mathbb{E}[XY] -\mathbb{E}[X]\mathbb{E}[Y]\right)\\ \\ &\quad\text{computational formula for variance and bilinearity-of-covariance}\cdots\\ &= \operatorname{Var}(X) + \operatorname{Var}(Y) + 2\operatorname{Cov}(X, Y) \end{align} \]

variance of the difference for independent variables

\[ \operatorname{Var}(X-Y)=\operatorname{Var}(X)+\operatorname{Var}(Y)\quad\text{for independent variables} \]

variance of the difference

\[ \operatorname{Var}(X - Y) = \operatorname{Var}(X) + \operatorname{Var}(Y) - 2 \operatorname{Cov}(X, Y) \]

using…

computational formular for variance
bilinearity of covariance
the linearity of expectation

\[ \begin{align} &\quad\text{computational formula for variance}\dots\\ \operatorname{Var}(X-Y)&=\mathbb{E}[(X-Y)^2]-\left(\mathbb{E}[X-Y]\right)^2\\ \\&\quad\text{by linearity of expectation}\dots\\ &=\mathbb{E}[X^2-2XY+Y^2]-\left(\mathbb{E}[X]-\mathbb{E}[Y]\right)^2\\ &=\mathbb{E}[X^2]-2\mathbb{E}[XY]+\mathbb{E}[Y^2]-\left(\mathbb{E}[X]^2-2\mathbb{E}[X]\mathbb{E}[Y]+\mathbb{E}[Y]^2\right)\\ &=\left(\mathbb{E}[X^2]-\mathbb{E}[X]^2\right)+\left(\mathbb{E}[Y^2]-\mathbb{E}[Y]^2\right)-2\left(\mathbb{E}[XY] -\mathbb{E}[X]\mathbb{E}[Y]\right)\\ \\ &\quad\text{computational formula for variance and bilinearity-of-covariance}\cdots\\ &= \operatorname{Var}(X) + \operatorname{Var}(Y) - 2\operatorname{Cov}(X, Y) \end{align} \]

variance of the difference for independent variables

\[ \operatorname{Var}(X - Y) = \operatorname{Var}(X) + \operatorname{Var}(Y)\quad\text{for independent variables} \]

Test Statistics

pooled t test test statistic:

using;

variance of expectation
variance of sum of independent variables

\[ \begin{align} &\text{data drawn from two }\textbf{populations}\text{ with differing mean and matching variance}:\\ &\qquad X_1\sim \mathcal{N}\left(\mu_1,\sigma^2\right) \quad\text{and}\quad X_2\sim \mathcal{N}\left(\mu_2,\sigma^2\right)\\\\ &\text{...but expectation variances won't match even under }H_0\text{ if the sizes of the datasets differ...}\\ &\qquad\bar{X}_1\sim \mathcal{N}\left(\mu_1,\sigma^2/n_1\right)\quad\text{and}\quad\bar{X}_2\sim \mathcal{N}\left(\mu_2,\sigma^2/n_2\right)\quad\text{(from variance of expectation)}\\ &\qquad\frac{\bar{X}_1-\hat{\mu}_1}{\hat{\sigma}/\sqrt{n_1}}\sim \mathcal{t}_{n_1-1} \quad\text{and}\quad\frac{\bar{X}_1-\hat{\mu}_2}{\hat{\sigma}/\sqrt{n_2}}\sim \mathcal{t}_{n_2-1}\qquad\text{(standardised)}\\ \end{align} \]

\[ \begin{align} &\text{sampling distribution of mean diference estimator has a similar form,}\\ &\qquad\mathbb{E}[\bar{X}_1-\bar{X}_2]=\mathbb{E}[\bar{X}_1]-\mathbb{E}[\bar{X}_2]=\mu'\qquad\text{(linearity of expectation)}\\ &\qquad\operatorname{Var}[\bar{X}_1-\bar{X}_2]=\operatorname{Var}[\bar{X}_1]+\operatorname{Var}[\bar{X}_2]=\frac{\sigma^2}{n_1}+\frac{\sigma^2}{n_2}=\sigma'\qquad\text{(variance of independent sum)}\\ &\quad\therefore\quad\mathbb{E}[X_1]-\mathbb{E}[X_2]\sim\mathcal{N}\left(\mu',{\sigma'}^2\right)\qquad\text{(population distribution)}\\ &\quad\therefore\quad\bar{X}_1-\bar{X}_2\sim\mathcal{t}_{n_1+n_2-2}\left(\mu',{\sigma'}^2\right)\qquad\text{(sampling distribution)} +1 \operatorname{dof} \text{ since mean not estimated under }H_0 \end{align} \]

welchtest statistic:

using;

variance of expectation
variance of sum of independent variables

test statistic:

\[ \begin{align} &\text{data drawn from two }\textbf{populations}\text{ with matching mean and matching variance, under }H_0:\\ &\qquad X_1\sim \mathcal{N}\left(\mu,\sigma_1^2\right) \quad\text{and}\quad X_2\sim \mathcal{N}\left(\mu,\sigma_2^2\right)\\ &\qquad\bar{X}_1\sim \mathcal{N}\left(\mu,\sigma_1^2/n_1\right)\quad\text{and}\quad\bar{X}_2\sim \mathcal{N}\left(\mu,\sigma_2^2/n_2\right)\quad\text{(from variance of expectation)}\\ &\qquad\frac{\bar{X}_1-\hat{\mu}}{\hat{\sigma_1}/\sqrt{n_1}}\sim \mathcal{t}_{n_1-1} \quad\text{and}\quad\frac{\bar{X}-\hat{\mu}}{\hat{\sigma_2}/\sqrt{n_2}}\sim \mathcal{t}_{n_2-1}\qquad\text{(standardised sampling distributions)}\\ \end{align} \]

\[ \begin{align} &\text{sampling distribution of mean diference estimator has a similar form,}\\ &\qquad\mathbb{E}[\bar{X}_1-\bar{X}_2]=\mathbb{E}[\bar{X}_1]-\mathbb{E}[\bar{X}_2]=0\qquad\text{(linearity of expectation)}\\ &\qquad\operatorname{Var}[\bar{X}_1-\bar{X}_2]=\operatorname{Var}[\bar{X}_1]+\operatorname{Var}[\bar{X}_2]=\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}\qquad\text{(variance of independent sum)}\\ &\quad\therefore\quad\mathbb{E}[X_1]-\mathbb{E}[X_2]\sim\mathcal{N}\left(0,\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}\right)\qquad\text{(sampling distribution {population})}\\ &\text{the two variance estimates are coupled - so in estimating both we }\textbf{loose a fractional dof}\\ &\quad\therefore\quad\boxed{\frac{\bar{X}_1-\bar{X}_2}{\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}}\sim\mathcal{t}_{\nu}}\qquad\text{(weltch -test statistic)} \end{align} \]

degrees of freedom:

\[ \begin{align} &\text{test statistic is a function of sampling distributions for sample variance of the two groups...}\\ &\qquad\boxed{ \begin{aligned} &\text{given i.i.d. data}\quad X_1,\dots,X_n\sim\mathcal{N}(\mu,\sigma^2)\quad:\\ &\;s^2=\frac{1}{n-1}\sum_{i=1}^n(X_i-\bar{X})^2\quad\text{(sample variance)}\\ &\;\frac{(n-1)s^2}{\sigma^2}\sim\chi^2_{n-1}\quad\text{(sample variance sampling distribution)} \end{aligned} }\\\\ &\quad t_\nu\sim\frac{(\bar{X}-\bar{Y})-(\mu_1-\mu_2)}{\sqrt{\frac{s_1}{n_1}+\frac{s_2}{n_2}}}\qquad\text{(standardised test statistic)} \\\\ &\quad\text{where}\quad s_1^2\sim\frac{\sigma_1^2}{n_1-1}\chi^2_{n_1-1}\qquad\text{and}\qquad s_2^2\sim\frac{\sigma_2^2}{n_2-1}\chi^2_{n_2-1} \end{align} \] using:

scaling of variance

\[ \begin{align} \text{lets now look at the}\quad&\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}\quad\text{term noting}\quad\mathbb{E}[\chi^2_\nu]=\nu\quad\text{and}\quad\operatorname{Var}(\chi^2_\nu)=2\nu:\\ \operatorname{Var}\left(\frac{s_1^2}{n_1}\right)&=\operatorname{Var} \left( \frac{\sigma_1^2}{n_1} \cdot \frac{\chi^2_{n_1 - 1}}{n_1 - 1} \right)\\ &=\underbrace{\left(\frac{\sigma_1^2}{n_1}\cdot\frac{1}{n_1-1}\right)^2}_{\text{scaling of variance}}\cdot\operatorname{Var}(\chi^2_{n_1-1})\\ &=\frac{\sigma_1^4}{n_1^2}\cdot\frac{1}{(n_1-1)^2}\cdot2\:\cdot(n_1-1)\\\\ \text{so}\quad\operatorname{Var}\left(\frac{s_1^2}{n_1}\right)=\frac{\sigma_1^4}{n_1^2}&\cdot\frac{2}{n_1-1}\qquad\text{and}\qquad\operatorname{Var}\left(\frac{s_1^2}{n_1}\right)=\frac{\sigma_1^4}{n_1^2}\cdot\frac{2}{n_1-1}\\\\ \therefore\qquad\operatorname{Var}\left(\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}\right)&\quad=\quad\frac{\sigma_1^4}{n_1^2}\cdot\frac{2}{n_1-1}\quad+\quad\frac{\sigma_2^4}{n_2^2}\cdot\frac{2}{n_2-1} \end{align} \] \[ \begin{align} &\text{remember the Student's t-distribution is a function of the z and chi-squared distribution...}\\ &\qquad\qquad\boxed{t_\nu\sim\frac{Z}{\sqrt{V/\nu}}\qquad\text{where}\quad Z\sim N(0,1)\quad\text{and}\quad V\sim\chi^2_\nu}\\ &\text{we previously calculated the test statistic}\\ &\qquad\qquad\boxed{\frac{\bar{X}_1-\bar{X}_2}{\sqrt{\frac{\hat{\sigma}_1^2}{n_1}+\frac{\hat{\sigma}_2^2}{n_2}}}\sim\mathcal{t}_{\nu}}\qquad\text{(weltch test statistic)}\\ &\text{under }H_0\text{ the means follows a normal distribution with zero mean}\\ &\:\text{notice numerator uses population parameters and denominator uses estimates...}\\ &\qquad\qquad\frac{N(0,\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2})}{\sqrt{\frac{\hat{\sigma}_1^2}{n_1}+\frac{\hat{\sigma}_2^2}{n_2}}}\sim\mathcal{t}_{\nu}\quad\text{or}\quad\frac{Z\cdot\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}}{\sqrt{\frac{\hat{\sigma}_1^2}{n_1}+\frac{\hat{\sigma}_2^2}{n_2}}}=\frac{Z}{\sqrt{\hat{\sigma}^2/\sigma^2}}\sim\mathcal{t}_{\nu}\\ &\text{now we can find the degrees of freedom solving for}\:\nu:\\ &\qquad\qquad\frac{\hat{\sigma}_1^2}{n_1}+\frac{\hat{\sigma}_2^2}{n_2}\approx\frac{\sigma^2}{\nu}\chi^2_\nu\qquad\text{Satterthwaite approximation}\\ &\nu\text{ assumes matching the first and second moments of both sides!} \end{align} \]

welch satterthwaite approximation:

We want to estimate the degrees of freedom for a statistic of the form:

\[ V = \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} \]

where each $s_i^2$ is the sample variance from a normal distribution with unknown population variance $\sigma_i^2$.

We approximate $V$ by a scaled chi-squared distribution:

\[ V \approx \frac{\sigma^2}{\nu} \chi^2_\nu \qquad \text{(Satterthwaite approximation)} \]

Step 1: Match Moments

The goal is to approximate the distribution of $V$ using a scaled chi-squared variable. For a chi-squared distribution:

$\mathbb{E}[\chi^2_\nu] = \nu$
$\text{Var}(\chi^2_\nu) = 2\nu$

So for $V \approx \frac{\sigma^2}{\nu} \chi^2_\nu$, we have:

$\mathbb{E}[V] = \sigma^2$
$\text{Var}(V) = \frac{2\sigma^4}{\nu}$

Step 2: Compute Moments from Components

Since $V = \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}$, we compute:

\[ \mathbb{E}[V] = \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2} \]

and

\[ \text{Var}(V) = \frac{2\sigma_1^4}{n_1^2(n_1 - 1)} + \frac{2\sigma_2^4}{n_2^2(n_2 - 1)} \]

Step 3: Match to Scaled Chi-Squared

We now match the variance of the approximation:

\[ \frac{2\sigma^4}{\nu} = \text{Var}(V) \quad \Rightarrow \quad \nu = \frac{2 \cdot \left( \mathbb{E}[V] \right)^2}{\text{Var}(V)} \]

Substituting in the expressions:

\[ \nu \approx \frac{\left( \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2} \right)^2} {\frac{\sigma_1^4}{n_1^2(n_1 - 1)} + \frac{\sigma_2^4}{n_2^2(n_2 - 1)}} \]

Step 4: Replace Population Variances with Sample Estimates

In practice, we replace \[\sigma_i^2\] with \[s_i^2\], yielding the Welch–Satterthwaite degrees of freedom:

\[ \boxed{ \nu \approx \frac{\left( \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} \right)^2} {\frac{(s_1^2 / n_1)^2}{n_1 - 1} + \frac{(s_2^2 / n_2)^2}{n_2 - 1}} } \]

This is used in Welch’s t-test for unequal variances.

paired t test test statistic:

using;

variance of expectation
variance of sum of independent variables

\[ \begin{align} &\text{data drawn from two groups with same mean and matching (unknown/estimated) variance}:\\ &\qquad X_1\sim \mathcal{N}\left(\mu_1,\sigma^2\right) \quad\text{and}\quad X_2\sim \mathcal{N}\left(\mu_2,\sigma^2\right)\qquad\text{(population)}\\ &\qquad\bar{X}_1\sim \mathcal{N}\left(\mu_1,\sigma^2/n_1\right)\quad\text{and}\quad\bar{X}_2\sim \mathcal{N}\left(\mu_2,\sigma^2/n_2\right)\quad\text{(variance of expectation)}\\ &\qquad\frac{\bar{X}_1-\hat{\mu}_1}{\hat{\sigma}/\sqrt{n_1}}\sim \mathcal{t}_{n_1-1} \quad\text{and}\quad\frac{\bar{X}_1-\hat{\mu}_2}{\hat{\sigma}/\sqrt{n_2}}\sim \mathcal{t}_{n_2-1}\qquad\text{(standardised)}\\ \end{align} \]

\[ \begin{align} &\text{sampling distribution of mean diference estimator has a similar form,}\\ &\qquad\bar{X}_1-\bar{X}_2\sim\mathcal{N}\left(\mu',\sigma^2\right)\qquad\text{(actual unknown distribution)}\\ &\qquad\mathbb{E}[\bar{X}_1-\bar{X}_2]=\mathbb{E}[\bar{X}_1]-\mathbb{E}[\bar{X}_2]=\mu'\qquad\text{(linearity of expectation)}\\ &\qquad\operatorname{Var}[\bar{X}_1-\bar{X}_2]=\operatorname{Var}[\bar{X}_1]+\operatorname{Var}[\bar{X}_2]=\frac{\sigma^2}{n_1}+\frac{\sigma^2}{n_2}=\sigma'\qquad\text{(variance of independent sum)}\\ \end{align} \]

\[ \begin{align} &\:\text{common variance estimated by the pooled sample variance,}\\ &\;\text{the weighted average of two sample variances...}\\ &\;\:\text{(variances don't match even under }H_0\text{ if the sizes of the datasets differ)}\\ &\qquad\bar{X}_1-\bar{X}_2\sim\mathcal{N}\left(\mu_1-\mu_2,\frac{\sigma_1^2}{2}+\frac{\sigma_2^2}{2}\right)\\ &\qquad\frac{\left(\bar{X}_1 - \bar{X}_2\right)-\hat{\mu}'}{\hat{\sigma}'\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \sim \mathcal{t}_{n_1+n_2-2}\quad\text{where}\quad\operatorname{SE} = \sqrt{ \hat{\sigma}'^2 \left( \frac{1}{n_1} + \frac{1}{n_2} \right) }\quad\text{and}\quad\hat{\sigma}'^2=\frac{(n_1-1)\hat{\sigma}_1^2+(n_2-1)\hat{\sigma}_2^2}{n_1+n_2-2}\\ &\qquad\frac{\bar{X}_1 - \bar{X}_2}{\hat{\sigma}'\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \sim \mathcal{t}_{n_1+n_2-2}\quad\text{(under }H_0:\hat{\mu}_1=\hat{\mu}_2\text{, and linearity of expectation: }\hat{\mu}'=0\text{)} \end{align} \] $$

Footnotes

Ronald A. Fisher introduced the idea of hypothesis testing in the early 20th century, using the term “significance” to describe whether an observed result was strong enough to warrant rejecting a null hypothesis. The significance level ($\alpha = 5\%$) sets the threshold for how much evidence is needed before rejecting the null hypothesis. A high significance level corresponds to a low threshold for rejecting the null hypothesis and visa versa.↩︎
Power is the ability of a test to detect a true effect ($\approx 80\%$) - the opposite of a Type II error ($\beta$), which is failing to detect an effect when one exists.↩︎

NHST term	symbol	description
nullable hypothesis	\(H_0\)	Falsifiable (no daemons), simple or suspected case - only rejected in the face of very compelling evidence.
alternative hypothesis	\(H_A\)	the alternative if \(H_0\) is rejected
test statistic	\(T\sim\{t,\chi^2,F,...\}\)	random variable with probability distribution given by the null distribution
null distribution	\(P(T\|H_0)\)	the PDF/PMF of test statistic \(T\) under \(H_0\)
p-value	\(p=1-F(t_{obs}\|H_0)\)	the cumulative “tail” probability of test statistic \(t\) distribution given \(H_0\) probability of obtaining a test statistic as extreme as or more extreme than the observed value
critical value/threshold	\(\{z_\alpha, t_\alpha\dots\}\)	value of the test statistic that defines the rejection threshold for \(H_0\)
Rejection Region	if \(T\) is in the rejection region we reject \(H_0\) in favour of \(H_A\)	extreme under the null hypothesis and more likely under the alternative hypothesis, marked red below
Type I Error	\(\alpha = P(\text{reject}\:H_0\|H_0)\)	false rejection of \(H_0\)
Type II Error	\(\beta=P(\text{fail to reject}\:H_0\|H_A)\)	false “acceptance” of \(H_0\)
significance Level¹	\(\alpha = P(\text{reject}\:H_0\|H_0)\)	likelihood of Type I error probability of finding an innocent person guilty (ideally small \(~0\))
power²	\(\text{Power}=1-\beta\)	probability of correct rejection of \(H_0\) probability of correctly finding a guilty party guilty (ideally large \(~1\))
Effect size	Cohen’s \(d\), Pearson’s \(r\), odds ratio \(\text{OR}\), eta squared \(\eta^2\)	Standardised measure of the magnitude of an effect (sample-size independent)

	Red Nodes: Significance threshold (\(\alpha\)) and Type I Error
	Green Node: Statistical power (1-\(\beta\))
	Red Arrows: Trade-off (higher \(\alpha\) - more power but more false positives)
	Blue Arrows: Power-boosting factors

Frequentist Modelling: Null Hypothesis Significance Testing (Neyman-Pearson) Paradigm

Glossary/Terminology

Null Hypothesis Non-Rejection/Rejection

Hypotheses: Simple & Composite

Power: The Actual Test Statistic Distribution

Significance (\(\alpha\)-value)

Controlling Power and Significance

Diagram Legend

\(p\)-value - the heart of NHST tests

Common Tests/Test Statistic…

Comparing Means

Student’s t-test - compare means of two unknown sampled normal distributions

One-sample t-test

Two-sample pooled t-test (matching variance)

Two-sample Welch’s t-test (different variance)

z-test - compare means of two known sampled normal distributions

F-Test TODO

Mann-Whitney U Test TODO - Compare medians

Categorical Data

Chi-squared test - compare sample to a distribution or categorical data from different contexts

Comparing Variances

F-test - Compare two variances

Correlation & Regression

Pearson correlation test - test linear relationship between two continuous variables

Spearman correlation test - test monotonic non-linear relationship for ranked data

Linear regression - determine if a linear relationship exists

One Way ANOVA (Analysis of Variance) - Compare means of 3 or more groups to check if any come from different populations

Hypothesis Dredging Avoidance

Appendix

Basic Statistical Properties

Expectation

Variance & Covariance

Test Statistics

Step 1: Match Moments

Step 2: Compute Moments from Components

Step 3: Match to Scaled Chi-Squared

Step 4: Replace Population Variances with Sample Estimates

Footnotes

Test Type	Alternative Hypothesis	Area Considered	Summary
Two-tailed	\(H_1:\mu\ne\mu_0\)	Both ends (tails) of the distribution p-value is the smaller CMF value left or right \(\times2\)	Surprise in either direction
One-tailed (left)	\(H_1:\mu<\mu_0\)	Left tail only, p-value is the CMF value is `p<dist>(...)` or `p<dist>(lower.tail = TRUE, ...)`	Surprise only if result is too small
One-tailed (right)	\(H_1:\mu>\mu_0\)	Right tail only, p-value is the CMF value is `1-p<dist>(...)` or `p<dist>(lower.tail = FALSE, ...)`	Surprise only if result is too large

expected education \ sector	Tech	Non-Tech	Total
Bachelor’s	\(0.7\cdot 50\)	\(0.3\cdot 50\)	50
Master’s	\(0.7\cdot 50\)	\(0.3\cdot 50\)	50
Total	70	30	100

Expected Product\City	Inverness	Dundee	Aberdeen
Apple	\(\frac{105\cdot 50}{180}=\)	\(\frac{105\cdot 65}{180}\)	\(\frac{105\cdot 65}{180}\)
Samsung	\(\frac{75\cdot 50}{180}\)	\(\frac{75\cdot 65}{180}\)	\(\frac{75\cdot 65}{180}\)