This document demonstrates various parametric and non parametric tests including the assumptions they make, the test statistical, the confidence intervals and plotting the critical region using R for both one-sample and two-sample scenarios, including one-way ANOVA.
Random Sampling: The data should be obtained through random sampling.
Independence: The observations should be independent.
Normality: The population or the sampling distribution should be approximately normally distributed.
Standard deviation : The population standard deviation should be known.
R test for assumptions:
The test statistic for the one-sample Z-test is calculated as:
\[ Z = \frac{{\bar{X} - \mu_0}}{{\sigma / \sqrt{n}}} \]
Where:
\(\bar{X}\) is the sample mean.
\(\mu_0\) is the hypothesized population mean.
\(\sigma\) is the population standard deviation.
\(n\) is the sample size.
The Z-test is used to test a population mean when the population standard deviation is known. The null and alternative hypotheses are:
\[ H_0: \mu = \mu_0 \quad \text{vs.} \quad H_1: \mu \neq \mu_0 \]
To perform a one-sample Z-test in R, you can use the following code:
# Load the library
library("BSDA")
# Sample data and population mean
data <- c(28.5, 29.8, 30.2, 31.0, 29.3)
mu_0 <- 30.0
# One-sample Z-test
z_test_result <- z.test(data, mu = mu_0, sigma.x = 1/sqrt(5))
z_test_result
##
## One-sample z-Test
##
## data: data
## z = -1.2, p-value = 0.2301
## alternative hypothesis: true mean is not equal to 30
## 95 percent confidence interval:
## 29.36801 30.15199
## sample estimates:
## mean of x
## 29.76
The confidence interval for the one-sample Z-test is calculated as:
\[ \left(\bar{X} - Z_{\alpha/2} \frac{\sigma}{\sqrt{n}}, \bar{X} + Z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}}\right) \]
Where:
\(\bar{X}\) is the sample mean.
\(Z_{\alpha/2}\) is the critical value from the standard normal distribution corresponding to the desired confidence level \(\alpha\).
\(\sigma\) is the population standard deviation.
\(n\) is the sample size.
The confidence interval for the sample data is:
# Confidence interval for one-sample Z-test
z_test_result$conf.int
## [1] 29.36801 30.15199
## attr(,"conf.level")
## [1] 0.95
To plot the critical region for the Z-test:
# Plot the critical region
x <- seq(26.5, 33.5, 0.1)
y <- dnorm(x, mean = mu_0, sd = sqrt(var(data)))
cr_left <- subset(data.frame(x, y), x < qnorm(0.025, mean = mu_0, sd = sd(data)))
cr_right <- subset(data.frame(x, y), x > qnorm(0.975, mean = mu_0, sd = sd(data)))
plot(x, y, type = "l", lwd = 2, col = "blue", xlab = "Sample Mean", ylab = "Density",
main = "Critical Region for Z-Test")
abline(h=0)
polygon(c(cr_left$x[1], cr_left$x[1], cr_left$x, cr_left$x[17], cr_left$x[17], cr_left$x[1]),
c(0, cr_left$y[1], cr_left$y, cr_left$y[17], 0, 0), col = "red")
polygon(c(cr_right$x[1], cr_right$x[1], cr_right$x, cr_right$x[17], cr_right$x[17], cr_right$x[1]),
c(0, cr_right$y[1], cr_right$y, cr_right$y[17], 0, 0), col = "red")
Random Sampling: The data should be obtained through random sampling.
Independence: The observations should be independent.
Normality: The population or the sampling distribution should be approximately normally distributed.
Standard Deviation: The population standard deviation must be unknown.
R tests for assumptions:
The test statistic for the one-sample t-test is calculated as:
\[ t = \frac{{\bar{X} - \mu_0}}{{s / \sqrt{n}}} \]
Where:
\(\bar{X}\) is the sample mean.
\(\mu_0\) is the hypothesized population mean.
\(s\) is the sample standard deviation.
\(n\) is the sample size.
The t-test is used when the population standard deviation is unknown. The null and alternative hypotheses are the same as for the Z-test. To perform a one-sample t-test in R:
# Sample data
data <- c(28.5, 29.8, 30.2, 31.0, 29.3)
# One-sample t-test
t_test_result <- t.test(data, mu = mu_0)
t_test_result
##
## One Sample t-test
##
## data: data
## t = -0.5711, df = 4, p-value = 0.5985
## alternative hypothesis: true mean is not equal to 30
## 95 percent confidence interval:
## 28.59323 30.92677
## sample estimates:
## mean of x
## 29.76
The confidence interval for the one-sample t-test is calculated as:
\[ \left(\bar{X} - t_{\alpha/2, n-1} \frac{s}{\sqrt{n}}, \bar{X} + t_{1-\alpha/2, n-1} \frac{s}{\sqrt{n}}\right) \]
Where:
\(\bar{X}\) is the sample mean.
\(t_{\alpha/2, n-1}\) is the critical value from the t-distribution with \(n-1\) degrees of freedom corresponding to the desired confidence level \(\alpha\).
\(s\) is the sample standard deviation.
\(n\) is the sample size.
The confidence interval for the sample data is:
# Confidence interval for one-sample t-test
t_test_result$conf.int
## [1] 28.59323 30.92677
## attr(,"conf.level")
## [1] 0.95
To plot the critical region for the t-test:
# Plot the critical region
x <- seq(-6, 6, 0.1)
y <- dt(x, df = 4)
cr_left <- subset(data.frame(x, y), x < qt(0.025, df = 4))
cr_right <- subset(data.frame(x, y), x > qt(0.975, df = 4))
plot(x, y, type = "l", lwd = 2, col = "blue", xlab = "Sample Mean", ylab = "Density",
main = "Critical Region for T-Test")
abline(h=0)
polygon(c(cr_left$x[1], cr_left$x[1], cr_left$x, cr_left$x[33], cr_left$x[33], -6),
c(0, cr_left$y[1], cr_left$y, cr_left$y[33], 0, 0), col = "red")
polygon(c(cr_right$x[1], cr_right$x[1], cr_right$x, cr_right$x[33], cr_right$x[33], cr_right$x[1]),
c(0, cr_right$y[1], cr_right$y, cr_right$y[33], 0, 0), col = "red")
Random Sampling: The data should be obtained through random sampling.
Independence: The observations should be independent.
Expected Frequencies: The expected frequencies in each cell of the contingency table should be greater than 1 for the chi-square test of independence.
Test for assumption:
The test statistic for the chi-square test is calculated as:
\[ \chi^2 = \sum \frac{{(O - E)^2}}{{E}} \]
Where:
\(O\) is the observed frequency.
\(E\) is the expected frequency.
The chi-square test is used to test the independence of two categorical variables. The null and alternative hypotheses are:
H0: The variables are independent
H1: The variables are not independent
To perform a chi-square test in R:
# Contingency table
observed <- matrix(c(10, 20, 30, 40), nrow = 2)
# Chi-square test
chi_square_result <- chisq.test(observed)
chi_square_result
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: observed
## X-squared = 0.44643, df = 1, p-value = 0.504
To plot the critical region for the chi-square test:
# Define the chi-square critical value
alpha <- 0.05
cv <- qchisq(1 - alpha, df = 3)
# Plot the critical region
x <- seq(0, 12.5, 0.1)
y <- dchisq(x, df = 3)
cr <- subset(data.frame(x, y), x > cv)
plot(x, y, type = "l", lwd = 2, col = "blue", xlab = "Chi-Square Statistic", ylab = "Density",
main = "Critical Region for Chi-Square Test")
abline(h=0)
polygon(c(cr$x[1], cr$x[1], cr$x, cr$x[47], cr$x[47], cr$x[1]),
c(0, cr$y[1], cr$y, cr$y[47], 0, 0), col = "red")
Random Sampling: The data should be obtained through random sampling.
Independence: The observations should be independent.
Normality: The population or the sampling distribution should be approximately normally distributed.
R test for assumption:
The test statistic for the two-sample Z-test is calculated as:
\[ Z = \frac{{\bar{X}_1 - \bar{X}_2}}{{\sqrt{\frac{{\sigma_1^2}}{{n_1}} + \frac{{\sigma_2^2}}{{n_2}}}}} \]
Where:
\(\bar{X}_1\) and \(\bar{X}_2\) are the sample means of the two groups.
\(\sigma_1^2\) and \(\sigma_2^2\) are the population variances of the two groups.
\(n_1\) and \(n_2\) are the sample sizes of the two groups.
The two-sample Z-test is used to compare two population means when the population standard deviations are known. The null and alternative hypotheses are:
\[ H_0: \mu_1 = \mu_2 \quad \text{vs.} \quad H_1: \mu_1 \neq \mu_2 \]
To perform a two-sample Z-test in R:
# Sample data
group1 <- c(28.5, 29.8, 30.2, 31.0, 29.3)
group2 <- c(30.8, 29.9, 28.7, 31.5, 30.2)
# Two-sample Z-test
z_test_2sample_result <- z.test(group1, group2, sigma.x = 0.75, sigma.y = 0.75)
z_test_2sample_result
##
## Two-sample z-Test
##
## data: group1 and group2
## z = -0.96977, p-value = 0.3322
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.3896925 0.4696925
## sample estimates:
## mean of x mean of y
## 29.76 30.22
\[ \left(\bar{X}_1 - \bar{X}_2 - Z_{\alpha/2} \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}, \bar{X}_1 - \bar{X}_2 + Z_{1-\alpha/2} \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}\right) \]
Where:
\(\bar{X}_1\) and \(\bar{X}_2\) are the sample means of the two groups.
\(Z_{\alpha/2}\) is the critical value from the standard normal distribution corresponding to the desired confidence level \(\alpha\).
\(\sigma_1^2\) and \(\sigma_2^2\) are the population variances of the two groups.
\(n_1\) and \(n_2\) are the sample sizes of the two groups.
The confidence interval for the sample data is:
# Confidence interval for two-sample Z-test
z_test_2sample_result$conf.int
## [1] -1.3896925 0.4696925
## attr(,"conf.level")
## [1] 0.95
To plot the critical region for the two-sample Z-test:
# Plot the critical region
x <- seq(26.5, 33.5, 0.1)
y1 <- dnorm(x, mean = mean(group1), sd = sd(group1))
y2 <- dnorm(x, mean = mean(group2), sd = sd(group2))
cr_left1 <- subset(data.frame(x, y1), x < qnorm(0.025, mean = mean(group1), sd = sd(group1)))
cr_right1 <- subset(data.frame(x, y1), x > qnorm(0.975, mean = mean(group1), sd = sd(group1)))
cr_left2 <- subset(data.frame(x, y2), x < qnorm(0.025, mean = mean(group2), sd = sd(group2)))
cr_right2 <- subset(data.frame(x, y2), x > qnorm(0.975, mean = mean(group2), sd = sd(group2)))
plot(x, y1, type = "l", lwd = 2, col = "steelblue", xlab = "Sample Mean", ylab = "Density",
main = "Critical Region for Two-Sample Z-Test")
lines(x, y2, lwd = 2, col = "maroon")
abline(h=0)
polygon(c(cr_left1$x, rev(cr_left1$x)), c(cr_left1$y1, rep(0, length(cr_left1$y1))), col = rgb(0, 0, 1, alpha = 0.4), border = "navy")
polygon(c(cr_right1$x, rev(cr_right1$x)), c(cr_right1$y1, rep(0, length(cr_right1$y1))), col = rgb(0, 0, 1, alpha = 0.4), border = "navy")
polygon(c(cr_left2$x, rev(cr_left2$x)), c(cr_left2$y2, rep(0, length(cr_left2$y2))), col = rgb(1, 0, 0, alpha = 0.4), border = "darkred")
polygon(c(cr_right2$x, rev(cr_right2$x)), c(cr_right2$y2, rep(0, length(cr_right2$y2))), col = rgb(1, 0, 0, alpha = 0.4), border = "darkred")
legend("topright", legend = c("Group1", "Group2"), col = c("steelblue", "maroon"), lty = 1, lwd = 2)
Random Sampling: The data should be obtained through random sampling.
Independence: The observations should be independent.
Normality: The population or the sampling distribution should be approximately normally distributed.
Equal Variances: The variances of the two groups should be approximately equal.
R tests for assumptions:
Normality: QQ plot, Chi-square, Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov (non parametric) test, etc.
Equal Variances: F, Bartlett, Levene, Fligner-Killeen (non parametric) test, etc.
The test statistic for the two-sample t-test is calculated as:
\[ t = \frac{{\bar{X}_1 - \bar{X}_2}}{{\sqrt{\frac{{s_1^2}}{{n_1}} + \frac{{s_2^2}}{{n_2}}}}} \]
Where:
\(\bar{X}_1\) and \(\bar{X}_2\) are the sample means of the two groups.
\(s_1\) and \(s_2\) are the sample standard deviations of the two groups.
\(n_1\) and \(n_2\) are the sample sizes of the two groups.
The two-sample t-test is used when the population standard deviations are unknown. The null and alternative hypotheses are the same as for the two-sample Z-test. To perform a two-sample t-test in R:
# Sample data
group1 <- c(28.5, 29.8, 30.2, 31.0, 29.3)
group2 <- c(30.8, 29.9, 28.7, 31.5, 30.2)
# Two-sample t-test
t_test_2sample_result <- t.test(group1, group2)
t_test_2sample_result
##
## Welch Two Sample t-test
##
## data: group1 and group2
## t = -0.73099, df = 7.9076, p-value = 0.4859
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.9140904 0.9940904
## sample estimates:
## mean of x mean of y
## 29.76 30.22
The confidence interval for the two-sample t-test is calculated as:
\[ \left(\bar{X}_1 - \bar{X}_2 - t_{\alpha/2, \text{df}} \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}, \bar{X}_1 - \bar{X}_2 + t_{1-\alpha/2, \text{df}} \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\right) \]
Where:
\(\bar{X}_1\) and \(\bar{X}_2\) are the sample means of the two groups.
\(t_{\alpha/2, \text{df}}\) is the critical value from the t-distribution with degrees of freedom (\(\text{df}\)) corresponding to the desired confidence level \(\alpha\).
\(s_1\) and \(s_2\) are the sample standard deviations of the two groups.
\(n_1\) and \(n_2\) are the sample sizes of the two groups.
The confidence interval for the sample data is:
# Confidence interval for two-sample t-test
t_test_2sample_result$conf.int
## [1] -1.9140904 0.9940904
## attr(,"conf.level")
## [1] 0.95
To plot the critical region for the two-sample t-test:
# Plot the critical region
x <- seq(-6, 6, 0.1)
y1 <- dt(x, df = 4)
y2 <- dt(x, df = 4)
cr_left1 <- subset(data.frame(x, y1), x < qt(0.025, df = 4))
cr_right1 <- subset(data.frame(x, y1), x > qt(0.975, df = 4))
cr_left2 <- subset(data.frame(x, y2), x < qt(0.025, df = 4))
cr_right2 <- subset(data.frame(x, y2), x > qt(0.975, df = 4))
plot(x, y1, type = "l", lwd = 2, col = "steelblue", xlab = "Sample Mean", ylab = "Density",
main = "Critical Region for Two-Sample T-Test")
lines(x, y2, lwd = 2, col = "maroon")
abline(h=0)
polygon(c(cr_left1$x, rev(cr_left1$x)), c(cr_left1$y1, rep(0, length(cr_left1$y1))), col = rgb(0, 0, 1, alpha = 0.4), border = "navy")
polygon(c(cr_right1$x, rev(cr_right1$x)), c(cr_right1$y1, rep(0, length(cr_right1$y1))), col = rgb(0, 0, 1, alpha = 0.4), border = "navy")
polygon(c(cr_left2$x, rev(cr_left2$x)), c(cr_left2$y2, rep(0, length(cr_left2$y2))), col = rgb(1, 0, 0, alpha = 0.4), border = "darkred")
polygon(c(cr_right2$x, rev(cr_right2$x)), c(cr_right2$y2, rep(0, length(cr_right2$y2))), col = rgb(1, 0, 0, alpha = 0.4), border = "darkred")
legend("topright", legend = c("Group1", "Group2"), col = c("steelblue", "maroon"), lty = 1, lwd = 2)
NOTE
- The plot might apparently look like it’s plotted just for variable y2 but it also contains y1, the reason it’s not visible is because they both take identical values.
- That is the reason why you see neither color steelblue nor maroon in the plot, but rather their amalgamation.
- Just to be sure that the code is working correctly, you can adjust parameter
lwdto some different values.- Also do note that in this test where we assume that population variance are unequal. In case they are equal, simply set the parameter
var.equal= TRUE inside the t.test function.
Random Sampling: The data should be obtained through random sampling.
Independence: The observations from the two populations should be independent.
Normality: The differences should follow an approximately normal distribution.
Equal Variances: The variances of the differences should be approximately equal.
R tests for assumptions:
Normality: QQ plot, Chi-square, Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov (non parametric) test, etc.
Equal Variances: F, Bartlett, Levene, Fligner-Killeen (non parametric) test, etc.
The test statistic for a paired t-test is calculated as:
\[ t = \frac{{\bar{d}}}{{s_d / \sqrt{n}}} \]
Where:
\(\bar{d}\) is the sample mean of the paired differences.
\(s_d\) is the sample standard deviation of the paired differences.
\(n\) is the number of pairs.
The null and alternative hypotheses for a paired t-test are as follows:
# Sample data
test1 <- c(28.5, 29.8, 30.2, 31.0, 29.3)
test2 <- c(30.8, 29.9, 28.7, 31.5, 30.2)
# Perform a paired t-test
paired_t_test_result <- t.test(test1, test2, paired = TRUE)
print(paired_t_test_result)
##
## Paired t-test
##
## data: test1 and test2
## t = -0.74859, df = 4, p-value = 0.4957
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -2.166102 1.246102
## sample estimates:
## mean difference
## -0.46
The confidence interval for the paired t-test is given by:
\[ \left(\bar{d} - t_{\alpha/2, n-1} \frac{s_d}{\sqrt{n}}, \bar{d} + t_{\alpha/2, n-1} \frac{s_d}{\sqrt{n}}\right) \]
Where:
\(\bar{d}\) is the sample mean of the paired differences.
\(t_{\alpha/2, n-1}\) is the critical value from the t-distribution with \(n-1\) degrees of freedom corresponding to the desired confidence level \(\alpha\).
\(s_d\) is the sample standard deviation of the paired differences.
\(n\) is the number of pairs.
The confidence interval for the sample data is:
# Confidence interval for two-sample t-test
paired_t_test_result$conf.int
## [1] -2.166102 1.246102
## attr(,"conf.level")
## [1] 0.95
To plot the critical region for the two-sample t-test:
# Plot the critical region
x <- seq(-5.5, 5.5, 0.1)
y <- dt(x, 4)
cr_left <- subset(data.frame(x, y), x < qt(0.025, df = 4))
cr_right <- subset(data.frame(x, y), x > qt(0.975, df = 4))
plot(x, y, type = "l", lwd = 2, col = "black", xlab = "t Statistic", ylab = "Density",
xlim = c(-5.5, 5.5), ylim = c(0, 0.4), main = "Critical Region for Paired T-Test")
abline(h=0)
polygon(c(cr_left$x, rev(cr_left$x)), c(cr_left$y, rep(0, length(cr_left$y))), col = rgb(0, 0, 1, alpha = 0.5), border = "navy")
polygon(c(cr_right$x, rev(cr_right$x)), c(cr_right$y, rep(0, length(cr_right$y))), col = rgb(0, 0, 1, alpha = 0.5), border = "navy")
Random Sampling: The data should be obtained through random sampling.
Independence: The observations should be independent.
Normality: The population or the sampling distribution should be approximately normally distributed.
Equal Variances: The variances of the two groups should be approximately equal.
R tests for assumptions:
Normality: QQ plot, Chi-square, Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov (non parametric) test, etc.
Equal Variances: F, Bartlett, Levene, Fligner-Killeen (non parametric) test, etc.
The test statistic for the F-test comparing two variances is calculated as:
\[ F = \frac{{s_1^2}}{{s_2^2}} \]
Where:
\(s_1^2\) is the sample variance of the first group.
\(s_2^2\) is the sample variance of the second group.
The F-test is used to compare the variances of two populations. The null and alternative hypotheses are:
\[ H_0: \sigma_1^2 = \sigma_2^2 \quad \text{vs.} \quad H_1: \sigma_1^2 \neq \sigma_2^2 \]
To perform an F-test for two variances in R:
# Sample data
group1 <- c(28.5, 29.8, 30.2, 31.0, 29.3, 27.9, 30.4)
group2 <- c(30.8, 29.9, 28.7, 31.5, 30.2, 28.1, 29.3)
# F-test
f_test_result <- var.test(group1, group2, alternative = "greater")
f_test_result
##
## F test to compare two variances
##
## data: group1 and group2
## F = 0.85491, num df = 6, denom df = 6, p-value = 0.573
## alternative hypothesis: true ratio of variances is greater than 1
## 95 percent confidence interval:
## 0.1995651 Inf
## sample estimates:
## ratio of variances
## 0.85491
To plot the critical region for the F-test:
# Define the F critical values
cv_left <- qf(0.025, df1 = 6, df2 = 6)
cv_right <- qf(0.975, df1 = 6, df2 = 6)
# Plot the critical region
x <- seq(0, 7, 0.05)
y <- df(x, df1 = 6, df2 = 6)
cr_left <- subset(data.frame(x, y), x < cv_left)
cr_right <- subset(data.frame(x, y), x > cv_right)
plot(x, y, type = "l", lwd = 2, col = "blue", xlab = "F Statistic", ylab = "Density", main = "Critical Region for F-Test (Two-Sample)")
abline(h=0)
polygon(c(cr_left$x, rev(cr_left$x)), c(cr_left$y, rep(0, length(cr_left$y))), col = "red")
polygon(c(cr_right$x, rev(cr_right$x)), c(cr_right$y, rep(0, length(cr_right$y))), col = "red")
abline(v = c(cv_left, cv_right), lty = 2, col = "black")
legend("topright", legend = c("Critical Region", "F Critical Value"), col = c("red", "black"), lty = c(1, 2), lwd = c(2, 1))
Random Sampling: The data should be obtained through random sampling.
Independence: The observations should be independent.
Normality: The population or the sampling distribution should be approximately normally distributed.
Equal Variances: The variances within each groups should be approximately equal.
R tests for assumptions:
Normality: QQ plot, Chi-square, Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov (non parametric) test, etc.
Equal Variances: F, Bartlett, Levene, Fligner-Killeen (non parametric) test, etc.
Analysis of Variance (ANOVA) is used to test the equality of means of more than two groups. The null and alternative hypotheses are:
H0: All group means are equal against
H1: At least one group mean is different
To perform an ANOVA test in R:
# Sample data for ANOVA
g1 <- c(25, 30, 35, 40, 45)
g2 <- c(20, 22, 26, 28, 30)
g3 <- c(15, 18, 21, 24, 17)
# ANOVA test
anova_result <- aov(c(g1, g2, g3) ~ rep(c("g1", "g2", "g3"), each = 5))
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## rep(c("g1", "g2", "g3"), each = 5) 2 650.8 325.4 10.59 0.00224 **
## Residuals 12 368.8 30.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To plot the critical region for ANOVA:
# Define the F critical value
alpha <- 0.05
cv <- qf(1 - alpha, df1 = 2, df2 = 12)
# Plot the critical region
x <- seq(0, 5, length.out = 100)
y <- df(x, df1 = 2, df2 = 12)
cr<- subset(data.frame(x, y), x > cv)
plot(x, y, type = "l", lwd = 2, col = "blue", xlab = "F Statistic", ylab = "Density", main = "Critical Region for ANOVA")
abline(h=0)
polygon(c(cr$x[1], cr$x[1], cr$x, cr$x[23], cr$x[23], cr$x[1]),
c(0, cr$y[1], cr$y, cr$y[23], 0, 0), col = "red")
abline(v = cv, lwd = 1, lty = 2, col = "green")
legend("topright", legend = c("Critical Region", "F-Critical Values"), col = c("red", "green"), lty = c(1, 2), lwd = c(2, 1))
This test is performed to check if the median of a single sample is equal to a specific value.
Random Sampling: The data should be obtained through random sampling.
Independence: The observations should be independent.
The null and alternative hypotheses are:
H0: The median of the sample is equal to a specified
value.
H1: The median of the sample is not equal to the specified
value.
The sign test involves comparing the number of positive and negative signs to assess whether the median is equal to a specified value. The value of the test statistic is the larger of the number of positive and negative signs.
To perform a sign test in R:
data <- c(0.5, 0.7, 0.8, 1.2, 0.6, 0.9, 0.3, 0.5, 0.7, 1.0, 0.8, 0.6, 0.5, 1.1, 1.2)
sign_test <- SIGN.test(data, md = 0.65)
sign_test
##
## One-sample Sign-Test
##
## data: data
## s = 9, p-value = 0.6072
## alternative hypothesis: true median is not equal to 0.65
## 95 percent confidence interval:
## 0.5178168 0.9821832
## sample estimates:
## median of x
## 0.7
##
## Achieved and Interpolated Confidence Intervals:
##
## Conf.Level L.E.pt U.E.pt
## Lower Achieved CI 0.8815 0.6000 0.9000
## Interpolated CI 0.9500 0.5178 0.9822
## Upper Achieved CI 0.9648 0.5000 1.0000
To compare two independent groups to determine if they have the same median.
Ordinal or Continuous: The variable you’re analyzing should be either ordinal or continuous.
Independence: All of the observations from both groups are independent of each other.
Shape: The shapes of the distributions for the two groups are roughly the same.
The null and alternative hypotheses are:
H0: The two groups have the same median.
H1: The two groups have the different median.
The test is conducted to compare two independent groups and determine if they have the same distribution. For group 1 with \(n_1\) observations and group 2 with \(n_2\) observations, \(U_1\) is calculated as:
\[ U_1 = n_1 \times n_2 + \frac{n_1 \times (n_1 + 1)}{2} - R_1 \]
\[ U_2 = n_1 \times n_2 + \frac{n_2 \times (n_2 + 1)}{2} - R_2 \]
Where \(R_1\) is the sum of ranks for group 1 and \(R_2\) is the sum of ranks for group 2.
To perform a Mann-Whitney U test in R:
group1 <- c(45, 52, 53, 48, 59, 50, 48, 56, 52, 55, 57, 58, 54, 51, 49)
group2 <- c(60, 55, 57, 61, 58, 62, 59, 63, 64, 65, 61, 60, 62, 67, 59)
mwu_test <- wilcox.test(group1, group2)
mwu_test
##
## Wilcoxon rank sum test with continuity correction
##
## data: group1 and group2
## W = 9.5, p-value = 2.07e-05
## alternative hypothesis: true location shift is not equal to 0
To compare two related (paired) groups to determine if their distributions differ significantly.
Paired observations: The data should be in form of pair of observations from a single group on your variable of interest.
Continuity: The variable of interest should be continuous.
Random Sampling: The data points for each group must be random.
The null and alternative hypotheses are:
H0: The median of difference between the two groups is
zero.
H1: The median of difference between the two groups is
different from zero.
W is calculated as the sum of the ranks of the positive differences between paired observations.
To perform a Wilcoxon Signed-Rank test in R:
before <- c(48, 52, 53, 48, 49, 50, 48, 56, 52, 51, 53, 55, 54, 51, 49)
after <- c(50, 54, 55, 53, 54, 55, 51, 57, 54, 52, 56, 58, 56, 53, 50)
wsr_test <- wilcox.test(before, after, paired = TRUE)
wsr_test
##
## Wilcoxon signed rank test with continuity correction
##
## data: before and after
## V = 0, p-value = 0.0006452
## alternative hypothesis: true location shift is not equal to 0
To compare three or more independent groups to determine if they have same median.
Ordinal or Continuous: The variable you’re analyzing should be either ordinal or continuous.
Independence: All of the observations from each groups are independent of each other.
Shape: The shapes of the distributions for the groups are roughly the same.
The null and alternative hypotheses are:
H0: All the groups have the equal median.
H1: At least two groups have a different median.
The test is conducted to compare independent groups and determine if they have the same median.
\[ H = \frac{12}{N(N + 1)} \sum \frac{R_j^2}{n_j} - 3(N + 1) \]
Where:
\(N\) is the total number of observations.
\(R_j\) is the sum of ranks for group \(j\).
\(n_j\) is the number of observations in group \(j\).
To perform a Kruskal-Wallis test in R:
group1 <- c(45, 52, 53, 48, 59, 50, 48, 56, 52, 55, 57, 58, 54, 51, 49)
group2 <- c(60, 55, 57, 61, 58, 62, 59, 63, 64, 65, 61, 60, 62, 67, 59)
group3 <- c(75, 80, 78, 79, 81, 82, 77, 79, 80, 76, 78, 81, 77, 79, 76)
kw_test <- kruskal.test(list(group1, group2, group3))
kw_test
##
## Kruskal-Wallis rank sum test
##
## data: list(group1, group2, group3)
## Kruskal-Wallis chi-squared = 37.6, df = 2, p-value = 6.843e-09
The following table summarizes the appropriate statistical tests based on the type of data and the hypothesis being tested.
| Variable Type | Test Description | Statistical Test |
|---|---|---|
| Measure of Center | Compare one population mean to a threshold, population variance known | 1-sample z-test/CI |
| Compare one population mean to a threshold, population variance unknown | 1-sample t-test/CI | |
| Compare two population means, populations are independent, equal variances | 2-sample t-test/CI (pooled variance) | |
| Compare two population means, populations are independent, unequal variances | 2-sample t-test/CI (unequal variances) | |
| Compare two population means, populations are paired | Paired t-test (1-sample t-test on differences) | |
| Compare three or more population means | ANOVA | |
| Compare one population median to threshold, continuous distribution | Sign Test | |
| Continuous distribution and symmetric | Wilcoxon Signed Rank Test | |
| Compare two population medians, populations are independent | Wilcoxon Rank-Sum Test, Mann-Whitney Statistic | |
| Populations are paired | Wilcoxon Signed Rank Test/Sign Test (on differences) | |
| Compare three or more population medians | Kruskal-Wallis Test | |
| Measure of Spread | Compare one population variance to threshold | Chi-Square test for variance |
| Compare two population variances | F-test for ratio of two variances | |
| Test equality of ≥2 population variances | Levene’s test, Brown-Forsythe Test | |
| Proportion | Compare one population proportion to threshold | Adjusted (Agresti-Coull) 1-sample proportion test/CI |
| Compare two population proportions, independent samples | 2-sample proportion test (on difference) | |
| Compare two population proportions, dependent samples | McNemar Test | |
| Compare more than two population proportions | Pearson’s chi-square test for homogeneity of proportions | |
| Relationships (Continuous) | Continuous response and continuous predictor | Simple linear regression; Pearson’s correlation coeff |
| Continuous response and categorical predictor | ANOVA | |
| Two or more predictor variables | Linear regression | |
| Relationships (Discrete) | Discrete response and continuous predictor | Logistic Regression |
| Discrete response and categorical predictor | Test of Independence, Pearson’s Chi-Square Test | |
| Distributions | Test if distribution follows a particular probability distribution | Pearson’s Chi-Square Goodness of Fit Test |
| Compare equality of two distributions | Kolmogorov-Smirnov Test | |
| Test for normality | Shapiro-Wilk, Anderson-Darling |