Statistical hypothesis testing is a fundamental concept in data analysis that helps us make informed decisions based on sample data. Traditionally, hypothesis tests are classified into two broad categories:
In this document, we will: - Recap key parametric tests used in hypothesis testing. - Understand their limitations and when they may not be appropriate. - Learn why non-parametric tests are valuable and when to use them.
A parametric test is a statistical test that assumes the data follows a specific probability distribution (e.g., normal distribution). These tests estimate parameters (such as the mean, variance) and rely on assumptions about the population.
Here are some commonly used parametric tests:
Test | Purpose | Example Use Case |
---|---|---|
Z-Test | Tests whether the mean of a sample differs from a known population mean when population variance is known. | Checking if the average IQ of students in a school is 100. |
t-Test | Compares means between one or two samples when population variance is unknown. | Comparing exam scores between students who took online vs. offline classes. |
ANOVA (F-Test) | Tests for differences among more than two group means. | Comparing salaries across different industries. |
Pearson Correlation | Measures linear association between two variables. | Checking the relationship between height and weight. |
Linear Regression | Models the relationship between one or more predictors and an outcome. | Predicting house prices based on size, location, etc. |
Consider a one-sample t-test, which tests if the mean of a sample differs from a hypothesized mean \(\mu_0\):
The test statistic is:
\[ t = \frac{\bar{X} - \mu_0}{\frac{S}{\sqrt{n}}} \]
where: - \(\bar{X}\) = Sample mean - \(S\) = Sample standard deviation - \(n\) = Sample size
Under \(H_0\), the test statistic follows a t-distribution with \(n-1\) degrees of freedom.
Despite their usefulness, parametric tests come with strict assumptions:
This is where non-parametric tests come to the rescue!
A non-parametric test does not assume a specific distribution for the data. Instead, it relies on ranks or medians, making it robust to non-normality, small samples, and outliers.
✅ No Normality Assumption: Works well with
skewed data or ordinal data.
✅ Handles Outliers: Since it is based on
ranks, extreme values do not distort results.
✅ Small Sample Friendly: Does not require
large sample sizes to be reliable.
✅ Works with Ordinal Data: Useful for surveys,
customer ratings, or Likert scales (1-5).
Non-Parametric Test | Alternative To | Purpose |
---|---|---|
Wilcoxon Signed-Rank | One-sample t-test | Tests whether the median of a single sample differs from a known value. |
Mann-Whitney U Test | Independent t-test | Compares two independent groups when normality is violated. |
Kruskal-Wallis Test | One-way ANOVA | Compares three or more groups when normality is violated. |
Spearman’s Rank Correlation | Pearson Correlation | Measures monotonic relationships (not necessarily linear). |
Friedman Test | Repeated Measures ANOVA | Compares multiple paired samples when normality is violated. |
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
library(ggplot2)
library(dplyr)
library(knitr)
library(car)
library(ggpubr)
# Load the dataset
data("iris")
# Display the first few rows
kable(head(iris), caption = "Sample of Iris Dataset")
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
# Load dataset
data("mtcars")
# Convert 'am' to a factor for readability
mtcars$am <- factor(mtcars$am, labels = c("Automatic", "Manual"))
# Display first few rows
kable(head(mtcars), caption = "First Six Rows of mtcars Dataset")
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | Manual | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | Manual | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | Manual | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | Automatic | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | Automatic | 3 | 2 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | Automatic | 3 | 1 |
# Load the dataset
data("faithful")
# Display the first few rows
head(faithful)
## eruptions waiting
## 1 3.600 79
## 2 1.800 54
## 3 3.333 74
## 4 2.283 62
## 5 4.533 85
## 6 2.883 55
ggplot(iris, aes(x = Petal.Length, fill = Species)) +
geom_histogram(bins = 15, alpha = 0.6, position = "identity") +
facet_wrap(~Species) +
labs(title = "Histogram of Petal Length by Species", x = "Petal Length", y = "Count") +
theme_minimal()
ggplot(mtcars, aes(x = mpg, fill = am)) +
geom_histogram(bins = 10, alpha = 0.6, position = "identity") +
facet_wrap(~am) +
labs(title = "Histogram of MPG for Automatic vs. Manual Cars",
x = "Miles Per Gallon (MPG)", y = "Count") +
theme_minimal()
Automatic cars show a nearly normal distribution.
Manual cars exhibit right skewness, which may violate the assumption of normality.
# Histogram of waiting times
ggplot(faithful, aes(x = waiting)) +
geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
labs(title = "Histogram of Waiting Times Between Eruptions",
x = "Waiting Time (minutes)",
y = "Frequency") +
theme_minimal()
ggqqplot(iris, x = "Petal.Length", facet.by = "Species", color = "Species") +
labs(title = "Q-Q Plot of Petal Length by Species")
ggqqplot(mtcars, x = "mpg", facet.by = "am", color = "am") +
labs(title = "Q-Q Plot of MPG by Transmission Type")
# Q-Q plot
ggplot(faithful, aes(sample = waiting)) +
stat_qq() +
stat_qq_line() +
labs(title = "Q-Q Plot of Waiting Times",
x = "Theoretical Quantiles",
y = "Sample Quantiles") +
theme_minimal()
# Apply Shapiro-Wilk test separately for each species
shapiro_results <- iris %>%
group_by(Species) %>%
summarise(p_value = shapiro.test(Petal.Length)$p.value)
print(shapiro_results, caption = "Shapiro-Wilk Test for Normality")
## # A tibble: 3 × 2
## Species p_value
## <fct> <dbl>
## 1 setosa 0.0548
## 2 versicolor 0.158
## 3 virginica 0.110
The Shapiro-Wilk test checks whether data follow a normal distribution. The null hypothesis (H0) assumes normality, and we reject H0 if the p-value is less than 0.05.
Species | p-value | Decision (α = 0.05) | Interpretation |
---|---|---|---|
Setosa | 0.0548 | Fail to Reject H0 | Data appear to follow a normal distribution. |
Versicolor | 0.1585 | Fail to Reject H0 | Data appear to follow a normal distribution. |
Virginica | 0.1098 | Fail to Reject H0 | Data appear to follow a normal distribution. |
Since all p-values are greater than 0.05, we fail to reject the null hypothesis for all three species. This suggests that the petal length data are not significantly different from a normal distribution.
✅ Yes! The normality assumption holds for all species.
Since all species pass the Shapiro-Wilk test, using parametric tests like ANOVA is appropriate.
However, for small samples (n < 30), normality tests can be less reliable. Always combine statistical tests with visual methods (e.g., Q-Q plots, histograms) before making final conclusions.
🔍 Final Answer: Normality assumption is OK. Parametric tests are justified. 🚀
# Apply Shapiro-Wilk test separately for each transmission type
shapiro_results <- mtcars %>%
group_by(am) %>%
summarise(p_value = shapiro.test(mpg)$p.value)
print(shapiro_results, caption = "Shapiro-Wilk Test for Normality")
## # A tibble: 2 × 2
## am p_value
## <fct> <dbl>
## 1 Automatic 0.899
## 2 Manual 0.536
# Shapiro-Wilk test
shapiro_test <- shapiro.test(faithful$waiting)
shapiro_test
##
## Shapiro-Wilk normality test
##
## data: faithful$waiting
## W = 0.92215, p-value = 1.015e-10
The Shapiro-Wilk test assesses whether the waiting times in the
faithful
dataset follow a normal distribution.
Since the p-value is less than 0.05, we reject H0.
This means the waiting time data are not normally distributed. The normality assumption is violated, making parametric tests inappropriate.
❌ Parametric tests (e.g., t-test, ANOVA) should not be used.
✅ Non-parametric tests (e.g., Wilcoxon rank-sum test, Kruskal-Wallis test) should be applied instead.
# Simulated exam scores
set.seed(42)
exam_scores <- rnorm(30, mean = 75, sd = 10)
# Perform One-Sample t-Test
t_test_one_sample <- t.test(exam_scores, mu = 75)
# Print test result
t_test_one_sample
##
## One Sample t-test
##
## data: exam_scores
## t = 0.29933, df = 29, p-value = 0.7668
## alternative hypothesis: true mean is not equal to 75
## 95 percent confidence interval:
## 70.99952 80.37222
## sample estimates:
## mean of x
## 75.68587
If the p-value is less than 0.05, we reject H0:
H0 → The mean exam score is significantly different from 75.
If the p-value is greater than or equal to 0.05, we fail to reject H0:
H0 → There is no statistically significant difference between the mean exam score and 75.
# Simulated teaching method data
traditional <- rnorm(15, mean = 75, sd = 10)
online <- rnorm(15, mean = 78, sd = 9)
# Perform Two-Sample t-Test
t_test_two_sample <- t.test(traditional, online, var.equal = TRUE)
# Print test result
t_test_two_sample
##
## Two Sample t-test
##
## data: traditional and online
## t = -2.0235, df = 28, p-value = 0.05266
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -14.69936545 0.08973592
## sample estimates:
## mean of x mean of y
## 71.57939 78.88420
# Simulated before-after weight data
before_diet <- rnorm(20, mean = 80, sd = 5)
after_diet <- before_diet - rnorm(20, mean = 2, sd = 1) # Expected weight loss
# Perform Paired t-Test
t_test_paired <- t.test(before_diet, after_diet, paired = TRUE)
# Print test result
t_test_paired
##
## Paired t-test
##
## data: before_diet and after_diet
## t = 10.142, df = 19, p-value = 4.188e-09
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 1.652319 2.511593
## sample estimates:
## mean difference
## 2.081956
# Simulated exam scores for 3 groups
set.seed(42)
method1 <- rnorm(15, mean = 70, sd = 10)
method2 <- rnorm(15, mean = 75, sd = 9)
method3 <- rnorm(15, mean = 78, sd = 8)
# Create data frame
df_anova <- data.frame(
Method = rep(c("Traditional", "Online", "Hybrid"), each = 15),
Score = c(method1, method2, method3)
)
# Perform ANOVA
anova_test <- aov(Score ~ Method, data = df_anova)
# Print summary
summary(anova_test)
## Df Sum Sq Mean Sq F value Pr(>F)
## Method 2 102 51.11 0.482 0.621
## Residuals 42 4450 105.95
If the p-value is less than 0.05, we reject H0:
H0 → The mean exam scores are equal across all teaching methods. (Or, more formally: All group means are equal.)
We conclude that there is a statistically significant difference in mean exam scores between at least one pair of teaching methods. (It’s important to note that this doesn’t tell us which* teaching methods are different, just that at least one pair is.)*
Test Type | Parametric Test | Non-Parametric Equivalent |
---|---|---|
One Sample Test | One-Sample t-Test | Wilcoxon Signed-Rank Test |
Two Sample Test | Independent t-Test | Mann-Whitney U Test |
Paired Sample Test | Paired t-Test | Wilcoxon Signed-Rank Test |
Multiple Groups | One-Way ANOVA | Kruskal-Wallis Test |
Association | Pearson Correlation | Spearman Rank Correlation |
A researcher collects data from 3 groups of students who used different study techniques and recorded their final exam scores.
# Simulated data
technique1 <- rnorm(12, mean = 70, sd = 10)
technique2 <- rnorm(12, mean = 75, sd = 12)
technique3 <- rnorm(12, mean = 78, sd = 8)
df_exercise <- data.frame(
Technique = rep(c("Flashcards", "Practice Tests", "Summarization"), each = 12),
Score = c(technique1, technique2, technique3)
)
# Check normality
shapiro.test(df_exercise$Score)
##
## Shapiro-Wilk normality test
##
## data: df_exercise$Score
## W = 0.91258, p-value = 0.00767
This document provides a detailed recap of parametric hypothesis tests and explains why non-parametric alternatives are important. 🚀
The roots of nonparametric statistics lie in early probability theory and empirical methods. Some of the key developments before the formalization of statistical hypothesis testing include:
As statisticians recognized the limitations of parametric tests, new nonparametric methods were developed. Below are some of the most important milestones:
With the advent of computers, nonparametric methods have become more computationally feasible, leading to the development of advanced resampling techniques: