What is Statistical Inference?

  • What is the “true” distribution of a relevant variable, given observations?
  • What is the true joint distribution of several variables, given observations?
  • Drawing conclusions about populations based on sample data …
  • Main approaches:
    1. Estimation: Determining likely values of population parameters
      Example: Estimating the average blood pressure of adults in a city based on a sample of 1000 residents.
    2. Hypothesis Testing: Assessing claims about populations
      Example: Determining if a new drug significantly reduces cholesterol levels compared to a placebo
    3. Forecasting (prognosis): What will be future values of relevant medical parameters:
      Example: Predicting the number of flu cases in the next winter season based on historical data and current trends.

Estimation

Point Estimation

  • Point Estimator: A single value that serves as a “best guess” of a population parameter
  • Example: Assume that systolic blood pressure is normally distributed. What are good point estimators \(\hat{\mu}, \hat{\sigma}\) for the expectation \(\mu\) and the standard deviation \(\sigma\)?
  • Typical properties of reasonable estimators:
    1. Unbiasedness: \(E(\hat{\theta}) = \theta\)
    2. Efficiency: Smallest variance among unbiased estimators
    3. Consistency: \(\text{plim}_{n \to \infty} \hat{\theta}=\theta\)

Common Point Estimators

  1. Binomial distribution: Sample Proportion (\(\hat{p}\)) for population proportion (\(p\)) \(\hat{p} = \frac{x}{n}\), where \(x\) is the number of successes

  2. Poisson distribution: Estimated cases per period (\(\hat{\lambda}\)) for cases per period in population (\(\lambda\)) \(\hat{\lambda} = \frac{\sum_{i=1}^n x_i}{n}\), where \(x_i\) are counted cases in \(n\) subsequent time periods.

  3. Normal distribution:

    • Sample Mean (\(\bar{X}\)) for population mean (\(\mu\)): \(\bar{X} = \frac{1}{n}\sum_{i=1}^n x_i\)

    • Sample Variance (\(s^2\)) for population variance (\(\sigma^2\)): \(s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{X})^2\)

Interval Estimate: Confidence Interval

  • An interval \(CI=[v_l,v_u]\) “likely” to contain the population parameter \(\theta\)
  • Trustworthiness specified by the level of confidence \(\gamma=1-\alpha\), e.g., 0.99
  • Probability that the true population parameter is in the CI: \(\gamma\)??
  • Correct: If data calculation of CI is repeated, then \(\theta\in \text{CI}\) in \(100\cdot\gamma\%\).\[P(v_l(X)\leq \theta \leq v_u(X))=\gamma\]

Confidence Intervals for expectation \(\mu\)

  • For a sample \(X_i\) i.i.d. \(N(\mu,\sigma^2)\) confidence level \(1-\alpha\): \[v=\bar{x} \pm t_{\alpha/2, n-1} \cdot \frac{s}{\sqrt{n}}\]
  • Where:
    • \(\bar{x}\) is the sample mean
    • \(t_{\alpha/2, n-1}\) is the t-value (quantile) for desired confidence level with \(n-1\) degrees of freedom
    • \(s\) is the sample standard deviation
    • \(n\) is the sample size

CI for expectation in R

# Load a medical dataset
library(MASS)
data(birthwt)

x <- birthwt$bwt
n <- length(x)

c(mean(x) + qt(0.05/2,n-1) * sd(x) / sqrt(n),mean(x) - qt(0.05/2,n-1) * sd(x) / sqrt(n))
[1] 2839.952 3049.222
# alternative
ci_result <- t.test(birthwt$bwt, conf.level = 0.95)
ci_result$conf.int
[1] 2839.952 3049.222
attr(,"conf.level")
[1] 0.95

Confidence Intervals: Proportions

  • Sample \(X_i\) i.i.d. Bernoulli with parameter \(p\) and sample size \(n\)
  • CI for parameter \(p\), if large sample size \(n\): \[v=\hat{p}\pm z_{\alpha/2}\cdot\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\]
  • Where:
    • \(\hat{p}\) is the relative frequency in the sample
    • \(z_{\alpha/2}\) is the \(\alpha/2\) standard normal quantile for the desired confidence level \(\gamma=1-\alpha\)
    • \(n\) is the sample size

CI for Proportion in R

# Calculate proportion of low birth weight babies
p <- mean(birthwt$low)
p
[1] 0.3121693
c(p + qnorm(0.05/2)*sqrt(p*(1-p)/n),p - qnorm(0.05/2)*sqrt(p*(1-p)/n))
[1] 0.2461071 0.3782315
# alternative: exact confidence interval
prop_ci <- prop.test(sum(birthwt$low), nrow(birthwt), conf.level = 0.95)

# Display results
round(prop_ci$conf.int, 3)
[1] 0.248 0.384
attr(,"conf.level")
[1] 0.95

Bootstrapping

  • Resampling technique for estimating the sampling distribution of statistics
  • useful when
    • no distribution known
    • no standard CI-formula can be used
    • small sample size
  • Resampling
    1. Draw a sample of size n with replacement from original data with size \(n\)
    2. Calculate the statistic of interest for this resample
    3. Repeat steps 1-2 \(M\) times (typically \(M\geq 1000\))
    4. Use the simulated distribution of the statistics to estimate CI by quantiles

Bootstrapping

  • For a statistic \(\hat{\theta}\):

  • Bootstrap estimate:

\[\hat{\theta}^* = \frac{1}{n} \sum_{i=1}^n \hat{\theta}_i\]

  • Bootstrap standard error:

\[SE(\hat{\theta}^*) = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (\hat{\theta}_i - \hat{\theta}^*)^2}\]

Where \(\hat{\theta}_i\) is the statistic calculated from the \(i\)-th bootstrap sample

Bootstrapping in R

library(mosaic)
# mean
simMean <- do(10000) * mean(~bwt,data=resample(birthwt)) 
# bootstrap estimate for the mean
mean(simMean$mean)
[1] 2945.304
# bootstrap confidence interval
quantile(simMean$mean,probs=c(0.025,0.9975))
    2.5%   99.75% 
2839.368 3088.826 
# proportion
simProp <- do(10000) *  mean(resample(birthwt$low == 1)) 
quantile(simProp$mean,probs=c(0.025,0.9975))
     2.5%    99.75% 
0.2486772 0.4126984 
# median
simMed <- do(10000) * median(~bwt,data=resample(birthwt)) 
# bootstrap estimate for the median
mean(simMed$median)
[1] 2976.418
# bootstrap confidence interval
quantile(simMed$median,probs=c(0.025,0.9975))
  2.5% 99.75% 
  2835   3147 

Main Paradigms of Estimation

  1. Maximum Likelihood Estimation (MLE)
    • Chooses parameters that maximize the likelihood of observing the data
    • Widely used, asymptotically efficient
  2. Method of Moments (MoM)
    • Equates sample moments with theoretical moments
    • Simple but do not have the good properties of MLE
  3. Bayesian Estimation
    • Incorporates prior beliefs about parameters
    • Provides full posterior distribution of parameters
    • Good approach for updating believes over time

Maximum Likelihood Estimation

Idea: Choose parameters that maximize the probability of observing the data

Likelihood function: \[L(\theta|x) = f(x|\theta) = \prod_i f(x_i|\theta)\]

Log-likelihood: \[\ell(\theta|x) = \log L(\theta|x) = \sum_i \log f(x_i|\theta)\]

MLE estimator: \[\hat{\theta}_{MLE} = \arg\max_{\theta} \ell(\theta|x)\]

Ronald A. Fisher (1890-1962)

  • Life and Career

    • British statistician, evolutionary biologist, and geneticist
    • Founder of modern statistics and evolutionary biology
    • Professor: UCL (1933-1943), Cambridge (1943-1957)
  • Key Contributions

    • Developed ANOVA, maximum likelihood
    • Pioneered design of experiments
    • Fundamental work in genetics and evolution “Statistical Methods for Research Workers” (1925), “Genetical Theory of Natural Selection” (1930), “The Design of Experiments” (1935)
  • Controversies:

    • Supported (positive) eugenics

Ronald A. Fisher

Maximum Likelihood: Main Properties

  • Consistency: \[\hat{\theta}_{MLE} \xrightarrow{p} \theta_0 \text{ as } n \to \infty\]

  • Asymptotic normality: \[\sqrt{n}(\hat{\theta}_{MLE} - \theta_0) \xrightarrow{d} N(0, I(\theta_0)^{-1})\]

  • Asymptotic efficiency: ML-estimators Achieves Cramér-Rao lower bound as \(n \to \infty\). There is no unbiased estimator with a smaller variance …

Example: Glucose level

load("diabetes.RData")
glucLev <- diabetes$Glucose[diabetes$Glucose>0]

library(fitdistrplus)
plotdist(glucLev, histo = TRUE, demp = TRUE)

Example: Glucose level

fg=fitdist(glucLev,"gamma")
summary(fg)
Fitting of the distribution ' gamma ' by maximum likelihood 
Parameters : 
        estimate  Std. Error
shape 16.2787190 0.824061426
rate   0.1338447 0.006880465
Loglikelihood:  -3667.141   AIC:  7338.281   BIC:  7347.556 
Correlation matrix:
          shape      rate
shape 1.0000000 0.9846507
rate  0.9846507 1.0000000
fln=fitdist(glucLev,"lnorm")
summary(fln)
Fitting of the distribution ' lnorm ' by maximum likelihood 
Parameters : 
         estimate  Std. Error
meanlog 4.7703198 0.009063607
sdlog   0.2503591 0.006408478
Loglikelihood:  -3665.757   AIC:  7335.513   BIC:  7344.788 
Correlation matrix:
        meanlog sdlog
meanlog       1     0
sdlog         0     1
fw=fitdist(glucLev,"weibull")
summary(fw)
Fitting of the distribution ' weibull ' by maximum likelihood 
Parameters : 
        estimate Std. Error
shape   4.213099  0.1134178
scale 133.645611  1.2178178
Loglikelihood:  -3706.577   AIC:  7417.154   BIC:  7426.428 
Correlation matrix:
          shape     scale
shape 1.0000000 0.3322829
scale 0.3322829 1.0000000

Example: Glucose level

plot.legend <- c("Gamma", "lognormal", "Weibull")

par(mfrow=c(1,2))
denscomp(list(fg, fln, fw), legendtext = plot.legend)
qqcomp(list(fg, fln, fw), legendtext = plot.legend)

Hypothesis Testing

Hypotheses and Rejection

  • In statistics, hypotheses are hypotheses about underlying distributions at population level
  • In parametric statistics: hypotheses about distribution parameters
  • Content-related hypotheses have to be translated into statistical hypotheses!
    • Glucose level in patients decreases “after” administering some drug.
      -> Assume that glucose level \(G\) is normally distributed with expectation \(\mu_b\) before, and \(\mu_a\) after administration. Hypothesis: \(H: \;\mu_a > \mu_b\).
  • It is not possible to prove a statistical hypothesis in direct manner.
  • Instead we formulate two contradictory hypotheses and try to “reject” one.
  • In order to confirm a hypothesis, reject the contradicting hypothesis!
  • Not possible to ultimately reject a hypothesis about a larger population. We want evidence that rejection is reasonable.

Hypotheses and Rejection

Hypotheses and Rejection

Hypotheses and Rejection

Hypotheses and Rejection

How to decide about rejection?

  • Test for a parameter \(\theta\) of the distribution of a random variable \(X\)

  • Consider simple Hypotheses: \(H_0: \theta=\theta_0\) versus \(H_1: \theta=\theta_1\)

  • Distribution of random variable \(X\):

    • Under \(H_0\) with PDF \(f(x|\theta_0)\)

    • Under \(H_1\) with PDF \(f(x|\theta_1)\)

Errors and Power

  • Data are random, hence any decision will randomly couse an error.

  • The Type I error or \(\alpha\)-error consists of incorrectly rejecting the null hypothesis. The probability that this happens should be not more that \(\alpha\) - the significance level.

  • Significance level is key: Tests at level \(\alpha\).

  • The Type II error or \(\beta\)-error consists of failing to reject the null hypothesis when the alternative is true. The probability of this error is denoted by \(\beta\).

  • The power \(1-\beta\) of a test is the probability of rightly deciding to reject the alternative hypothesis.

  • In medical contexts often

    • power 1 - β is called sensitivity
    • 1 - α is called specificity.

Errors and Power

Reality
H₀ is true H₁ is true
Decision for H₀ right decision
true negative
specificity \(1-\alpha\)
type II error
false negative
\(\beta\)
for H₁ type I error
false positive
\(\alpha\)
right decision
true positive
sensitivity \(1-\beta\)

How to decide about rejection?

  • Neyman-Pearson test: compare likelihoods of observations \(x_1, \dots, x_n\) \[L(\theta_i|x) = \prod_j f(x_j|\theta_i)\]

  • Given the data \(x\), reject the null hypothesis if \[T(x) = \frac{L(\theta_1|x)}{L(\theta_0|x)}>\gamma,\]where \(\gamma\) is an \(1-\alpha\) quantile of the distribution of \(T\) under \(H_0\).

  • Is based on the same idea as Maximum-Likelihood

  • Most powerful test at level \(\alpha\).

  • Generalizes to hypotheses like \(\theta\leq \theta_0\) vs. \(\theta>\theta_0\) (certain distributions).

Simplified view

  • Start with a test statistics \(T\) - derived earlier by Neyman-Pearson approach

  • From data \(x\) the value of the statistics has been calculated as \(t_d=T(x)\).

  • Under \(H_0\) this statistics has a certain distribution and PDF \(f_0(t)\)

  • Consider the null hypothesis \(\theta\leq \theta_0\), then typically \(H_0\) is rejected when value \(t\) is large enough.

  • Enough means that \(t\) is larger than the \(1-\alpha\) quantile of the distribution of \(T\) with under \(H_0\)

p-Values

  • In similar manner for the null hypothesis \(\theta\geq \theta_0\), typically \(H_0\) is rejected when value \(t\) is small enough.

  • For the null hypothesis \(\theta= \theta_0\), typically \(H_0\) is rejected when value \(t\) is either small or large enough.

  • Alternatively and equivalently, the decision can be based on the p-value:

    • For \(H_0: \theta\leq \theta_0\), calculate the probability \(p=P_0(T\geq t)\) under the null hypothesis

    • For \(H_0: \theta\geq \theta_0\), calculate the probability \(p=P_0(T\leq t)\) under the null hypothesis

    • For \(H_0: \theta = \theta_0\), calculate the probabilities \(p_1=P_0(T\geq t)\) and \(p_2=P_0(T\leq t)\) under the null hypothesis.

  • If the relevant \(p\)-values is larger than significance level \(\alpha\), reject \(H_0\)

Simplest Test

  • Assume a sample \(x_1,\cdots,x_n\) from i.i.d. \(X_i~N(\mu,\sigma^2)\). The sample mean (n=20) has been calculated as \(\bar{x}=132\)

  • The parameter \(\sigma = 5\) is known.

  • We use \(\alpha=0.05\)

  • \(H_0: \mu\leq 130\)

  • The distribution of the test statistics \(T=\frac{1}{n}\sum_{i=1}^n X_i\) under \(H_0\):

    • Sum of i.i.d. normal distributions is normally distributed

    • \(E(T) = \frac{1}{n}\sum_{i=1}^n E(X_i) = \mu_0=130\)

    • \(Var(T)=\frac{1}{n^2}\sum_{i=1}^n Var(X_i) = \frac{25}{20}\approx 1.25\)

Simplest Test

Null hypothesis mean (mu0): 130 
Critical value: 131.839 
Rejection region: T > 131.839 

Calculation

  • Usually, a slightly modified \(N(0,1)\) statistics is used\[z = \frac{\bar{x} - \mu_0}{\sigma/\sqrt{n}}.\]
mu0 <- 130  # null hypothesis mean
sigma <- 5  # known standard deviation
n <- 20     # sample size
se <- sigma / sqrt(n)  # standard error
xBar <- 132
# Calculate critical value and z-value
alpha <- 0.05
(t_crit <- qnorm(1 - alpha))
[1] 1.644854
(z <- (xBar - mu0)/se)
[1] 1.788854
#p-value
1 - pnorm(z)
[1] 0.03681914

Steps in Hypothesis Testing

  1. State the hypotheses
  2. Choose the significance level (α)
  3. Select the appropriate test statistic
  4. Calculate the test statistic
  5. Determine the p-value
  6. Make a decision and interpret results

One-Sample t-test

  • Compare a sample mean to a hypothesized population mean
  • Assumption: data are from a normal distribution \(N(\mu,\sigma^2)\), \(\sigma\) unknown
  • Hypotheses:
    • \(H_0: \mu = \mu_0\) versus (two-tailed)
    • \(H_0: \mu \leq \mu_0\) versus \(H_1: \mu > \mu_0\)
    • \(H_0: \mu \geq \mu_0\) versus \(H_1: \mu < \mu_0\)
  • Test statistic: \[t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}\]
  • Under \(H_0\): Test statistics \(t\) is distributed according to a \(t\)-distribution with : \(df = n - 1\)

Student’s t-Distribution

  • Symmetric, bell-shaped distribution (like normal distribution)
  • Developed for small sample inference when σ unknown
  • Defined by degrees of freedom (df = n-1)
  • Heavier tails than normal distribution
  • Mean: \(\mu = 0\) (for all df)
  • Variance: \(\sigma^2 = \frac{df}{df-2}\) for df > 2
  • Probability Density Function: \(f(t) = \frac{\Gamma(\frac{df+1}{2})}{\sqrt{df\pi}\,\Gamma(\frac{df}{2})} \left(1 + \frac{t^2}{df}\right)^{-\frac{df+1}{2}}\)

Relationship to Normal Distribution

  • Test statistic:
    • z-statistic (known σ): \(z = \frac{\bar{x} - \mu}{\sigma/\sqrt{n}}\)
    • t-statistic (unknown σ): \(t = \frac{\bar{x} - \mu}{s/\sqrt{n}}\)
  • Key differences:
    • t-distribution has heavier tails
    • Critical values are larger than normal distribution
  • As df → ∞, t-distribution → standard normal distribution

Student’s t-Distribution

William Sealy Gosset (1876-1937)

  • Life and Career
    • English statistician and chemist
    • Worked at Guinness Brewery (1899-1937)
    • Published under pseudonym “Student”
    • Collaborated with Karl Pearson and R.A. Fisher
  • Key contributions
    • Student’s t-distribution and t-test (1908)
    • Pioneered small sample statistics
    • Contributions to experimental design
    • Introduced sequential analysis concept

W. S. Gosset

Example: One-Sample t-Test in R

# Test if expected birth weight differs from 3000 grams
t_test_result <- t.test(birthwt$bwt, mu = 3000)

t_test_result

    One Sample t-test

data:  birthwt$bwt
t = -1.0447, df = 188, p-value = 0.2975
alternative hypothesis: true mean is not equal to 3000
95 percent confidence interval:
 2839.952 3049.222
sample estimates:
mean of x 
 2944.587 

t-Test: Syntax

Visualizing One-Sample t-test

Reporting One-Sample t-test Results

“In a sample of 189 infants, the mean birth weight (M = 2944.6g, SD = 729.0) was not significantly different from the hypothesized population mean of 3000g (t(188) = -1.04, p = 0.299, 95% CI [2841.9g, 3047.3g]).”

2. Two-Sample t-test (Independent)

  • Use: Compare means between two independent groups
  • Hypotheses:
    • \(H_0: \mu_1 = \mu_2, \;\mu_1\leq \mu_2,\; \mu_1\geq\mu_2\)
    • \(H_1: \mu_1 \neq \mu_2, \; \mu_1>\mu_2,\;\mu_1<\mu_2\)
  • Test statistic: \(t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{s_p^2(\frac{1}{n_1} + \frac{1}{n_2})}}\)
  • Pooled variance: \(s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}\)
  • Degrees of freedom: \(df = n_1 + n_2 - 2\)

Example: Two-Sample t-test in R

# Compare birth weights between smoking and non-smoking mothers
t.test(bwt ~ smoke, data = birthwt)

    Welch Two Sample t-test

data:  bwt by smoke
t = 2.7299, df = 170.1, p-value = 0.007003
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
  78.57486 488.97860
sample estimates:
mean in group 0 mean in group 1 
       3055.696        2771.919 
# one sided alternative. Note that group1 - group2!
t.test(bwt ~ smoke, data = birthwt, alternative="greater")

    Welch Two Sample t-test

data:  bwt by smoke
t = 2.7299, df = 170.1, p-value = 0.003501
alternative hypothesis: true difference in means between group 0 and group 1 is greater than 0
95 percent confidence interval:
 111.8548      Inf
sample estimates:
mean in group 0 mean in group 1 
       3055.696        2771.919 

Visualizing the Two-Sample t-test

Reporting Two-Sample t-test Results

“In a study of 189 infants, birth weight was significantly lower for infants of smokers (n = 86, M = 2771.5g, SD = 678.4) compared to non-smokers (n = 103, M = 3055.7g, SD = 729.0), t(187) = -2.80, p = 0.006, 95% CI of the difference [-486.3g, -82.1g].”

Origins of the Two-Sample t-test

  • Built upon the foundation of the one-sample t-test
  • Formalized by Ronald A. Fisher in the 1920s as part of his work on statistical methods in research
  • Initially developed for agricultural research, particularly for comparing crop yields under different conditions
  • The test addressed the need to compare means from two independent groups, a common scenario in experimental research
  • Fisher’s work on the two-sample t-test was part of his broader contributions to the field of analysis of variance (ANOVA)
  • The test quickly found applications beyond agriculture, becoming a staple in medical and social science research

Paired t-test

  • Use: Compare means between two samples. Measurement is done two times at the same individuals
  • There is dependency between the two groups.
  • All calculations based on differences \(D=X_1-X_2\)
  • Hypotheses (two-tailed):
    • \(H_0: \mu_D = 0\)
    • \(H_1: \mu_D \neq 0\)
  • Test statistic:\[t=\frac{\bar{D}}{s_D/\sqrt{n}}\]
  • Degrees of freedom: \(df = n - 1\)

Paired t-Test in R

# Simulate paired data (e.g., blood pressure before and after treatment)
set.seed(123)
bp_before <- rnorm(30, mean = 140, sd = 10)
bp_after <- bp_before + rnorm(30, mean = -5, sd = 5)

# Perform paired t-test
t.test(bp_before, bp_after, paired = TRUE, alternative = "greater")

    Paired t-test

data:  bp_before and bp_after
t = 5.3889, df = 29, p-value = 4.305e-06
alternative hypothesis: true mean difference is greater than 0
95 percent confidence interval:
 2.812955      Inf
sample estimates:
mean difference 
       4.108308 

Reporting Paired t-test Results

“In a study of 30 patients, blood pressure significantly decreased after treatment (M = -5.2 mmHg, SD = 5.1), t(29) = -5.57, p < 0.001, 95% CI [-7.1, -3.3].”

Statistical Power

  • Definition: Probability of correctly rejecting a false null hypothesis
  • Factors affecting power:
    1. Sample size (n)
    2. Effect size (d)
    3. Significance level (α)
    4. Variability in the data (σ)
  • Each test has its own formula, t-test: \[n = \frac{2(z_{\alpha/2} + z_{\beta})^2 \sigma^2}{d^2}\]
  • Effect size d is also different, one sample t-test (Cohen): \[d=\frac{\bar{x}-\mu_0}{s/\sqrt{n}}\]

Example: Power Analysis in R

library(pwr)
# Calculate power for two-sample t-test.
pwr.t.test(n = 30, d = 0.5, sig.level = 0.05)

     Two-sample t test power calculation 

              n = 30
              d = 0.5
      sig.level = 0.05
          power = 0.4778965
    alternative = two.sided

NOTE: n is number in *each* group
# Calculate n, necessary to reach some power
pwr.t.test(power = 0.8, d = 0.5, sig.level = 0.05)

     Two-sample t test power calculation 

              n = 63.76561
              d = 0.5
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group

Power Calculations in R

Tests for Proportions

  • The binomial test can be used to compare proportions between two groups (“success”, “failure”).

  • It assumes that the underlying distribution is a binomial distribution with \(n\) trials and probability of success \(p\).

  • Hypotheses are then formulated about \(p\).

  • Let’s test if the proportion of low birthweight babies in our sample differs from 10%: \(\text{H}_0: p\leq 0.10 \text{ versus H}_1: p>0.10\)

# Test proportion of low birthweight babies
binom.test(~(low==1), data=birthwt, p=0.10, alternative = "greater")



data:  birthwt$(low == 1)  [with success = TRUE]
number of successes = 59, number of trials = 189, p-value = 8.405e-16
alternative hypothesis: true probability of success is greater than 0.1
95 percent confidence interval:
 0.2565953 1.0000000
sample estimates:
probability of success 
             0.3121693 

Tests for Proportions

In this dataset from the Baystate Medical Center in Springfield, Massachusetts, we observe a much higher rate of low birthweight babies than 10%. This is because this was a targeted study of risk factors for low birthweight, not a representative population sample.

Tests for Proportions

  • It is also possible to test, whether proportions in one variable are different in between groups fefined by another variable
tally(low ~ smoke,data=birthwt)
   smoke
low  0  1
  0 86 44
  1 29 30
prop.test(low ~ smoke, data = birthwt, 
          alternative = "less",
          success = 1)

    2-sample test for equality of proportions with continuity correction

data:  tally(low ~ smoke)
X-squared = 4.2359, df = 1, p-value = 0.01979
alternative hypothesis: less
95 percent confidence interval:
 -1.00000000 -0.02701885
sample estimates:
   prop 1    prop 2 
0.2521739 0.4054054 

Chi-square Test of Independence

  • Use: Test association between two categorical variables
  • Hypotheses:
    • \(H_0\): No association between variables
    • \(H_1\): Association exists between variables
  • Test statistic: \(\chi^2 = \sum_{ij} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\)
  • Degrees of freedom: \(df = (r-1)(c-1)\), where \(r\) = number of rows, \(c\) = number of columns
  • Under

Example: Chi-square Test in R

# frequency table
(tab <- table(birthwt$low, birthwt$smoke))
   
     0  1
  0 86 44
  1 29 30
# Test association between low birth weight and smoking
chisq.test(tab)

    Pearson's Chi-squared test with Yates' continuity correction

data:  tab
X-squared = 4.2359, df = 1, p-value = 0.03958

Reporting Chi-square Test Results

Example: “In a sample of 189 mother-infant pairs, there was a significant association between low birth weight and maternal smoking status, χ²(1, N = 189) = 8.90, p = 0.003. The odds of having a low birth weight baby were 2.02 times higher for smokers compared to non-smokers (95% CI [1.26, 3.18]).”

Origins of the Chi-square Test

  • Developed by Karl Pearson in 1900
  • Introduced in his paper “On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling”
  • Pearson was addressing problems in biology and evolution, inspired by questions raised by Walter Frank Raphael Weldon
  • Originally used to test goodness of fit between observed and theoretical distributions
  • The test was a breakthrough in allowing researchers to quantify the agreement between observed data and a hypothesized distribution
  • Later extended to test independence between categorical variables
  • It became a cornerstone of contingency table analysis, widely used in medical research for analyzing categorical data

Nonparametric Tests

Nonparametric Statistics

  • What is Nonparametric Statistics?
    • Statistical methods that do not rely on assumptions about the underlying distribution of the data
    • Useful when data doesn’t follow a normal distribution or when sample sizes are small
    • May be used with ordinal data or unclear scaling
    • When dealing with outliers that might skew parametric results
    • Often based on ranks of the data rather than the actual values
    • In small samples where it’s difficult to verify distributional assumptions
    • Generally more robust but less powerful than parametric tests when assumptions are met

Choosing Between Parametric and Nonparametric Tests

  • Consider the nature of your data (continuous, ordinal, nominal)
  • Check the assumptions of parametric tests (normality, homogeneity of variance)
  • Evaluate sample size and presence of outliers
  • Consider the research question and desired power of the analysis

Mann-Whitney U Test (Wilcoxon Rank-Sum Test)

  • Nonparametric alternative to the independent two-sample t-test
  • Used to compare the medians of two independent groups
  • Based on the ranks of the observations across both groups

Example: Mann-Whitney U Test in R

# Perform Mann-Whitney U test on birth weight between smokers and non-smokers
wilcox.test(bwt ~ smoke, data = birthwt)

    Wilcoxon rank sum test with continuity correction

data:  bwt by smoke
W = 5249.5, p-value = 0.006768
alternative hypothesis: true location shift is not equal to 0

Reporting Mann-Whitney U Test Results

Example: “A Mann-Whitney U test revealed a significant difference in birth weights between infants of smokers and non-smokers (W = 5331.5, p = 0.004).”

Wilcoxon Signed-Rank Test

  • Nonparametric alternative to the paired t-test
  • Can be also applied to the one-sample case
  • No assumption on the distribution
  • Hypotheses about the median
  • Used to compare a sample median to a hypothesized value or to compare two related samples
  • Based on the ranks of the (absolute) differences between pairs of observations

Wilcoxon Signed-Rank Test in R

# Simulating blood pressure data before and after treatment
set.seed(123)
bp_before <- rnorm(30, mean = 140, sd = 10)
bp_after <- bp_before + rnorm(30, mean = -5, sd = 5)

# Performing Wilcoxon Signed-Rank Test
wilcox.test(bp_before, bp_after, paired = TRUE)

    Wilcoxon signed rank exact test

data:  bp_before and bp_after
V = 425, p-value = 1.598e-05
alternative hypothesis: true location shift is not equal to 0

Reporting Wilcoxon Signed-Rank Test Results

Example: “A Wilcoxon signed-rank test indicated that the median reduction in blood pressure after treatment was statistically significant, V = 465, p < 0.001.”

Multiple Testing Problem

Introduction to Multiple Testing

  • Issue: Increased likelihood of Type I errors when performing multiple tests
  • Family-wise error rate (FWER): Probability of making at least one Type I error in a set of tests
  • False Discovery Rate (FDR): Expected proportion of false discoveries among all discoveries

Bonferroni Correction

  • Simplest and most conservative approach
  • Adjusted significance level: \(\alpha_{adjusted} = \alpha / m\), where \(m\) is the number of tests
  • Pros: Easy to implement and understand
  • Cons: Can be overly conservative, especially for large numbers of tests

Holm-Bonferroni Method

  1. Order the p-values from smallest to largest: \(p_{(1)}, p_{(2)}, ..., p_{(m)}\)
  2. For each \(p_{(i)}\), compare with \(\alpha / (m - i + 1)\)
  3. Find the first \(k\) such that \(p_{(k)} > \alpha / (m - k + 1)\)
  4. Reject null hypotheses for tests 1 to \(k-1\), accept null hypotheses for tests \(k\) to \(m\)
  • Pros: More powerful than Bonferroni, while still controlling FWER
  • Cons: Can still be conservative for very large numbers of tests

False Discovery Rate (FDR) Control

  • Less conservative approach focused on controlling the proportion of false positives
  • Benjamini-Hochberg procedure is a common method for FDR control
  • Pros: More powerful for large-scale testing (e.g., genomics)
  • Cons: Does not provide strong control of the FWER

Example: Multiple Testing Corrections in R

# Perform multiple t-tests
(p_values <- c(0.001,0.005,0.01,0.03,0.045))
[1] 0.001 0.005 0.010 0.030 0.045
# Apply different corrections
p.adjust(p_values, method = "bonferroni")
[1] 0.005 0.025 0.050 0.150 0.225
p.adjust(p_values, method = "holm")
[1] 0.005 0.020 0.030 0.060 0.060
p.adjust(p_values, method = "fdr")
[1] 0.00500000 0.01250000 0.01666667 0.03750000 0.04500000