The Common thing between two methods is both are to compare group means.
The different thing between two methods is ANOVA is used when more than two group means are compared, whereas a t-test can only compare two group means.
Then, one question comes up. why do we need ANOVA to compare means although we have method, t-test? The very simple reason is that we have to perform t-test more than three times corresponding the number of groups. Let’s see the code below.
Here, Null hypothesis is All groups are equal in terms of the mean. Alternative hypothesis is that not all groups are equal in terms of the mean.
set.seed(270)
# Step 1. Create Vectors for three groups
A_group <- rnorm(5, mean = 0, sd = 5)
B_group <- rnorm(5, mean = 0, sd = 5)
C_group <- rnorm(5, mean = 0, sd = 5)
# Step 2. Create data frame
com_data <- data.frame(
factor = c("A_group", "A_group", "A_group", "A_group", "A_group", "B_group", "B_group", "B_group", "B_group", "B_group", "C_group", "C_group", "C_group", "C_group", "C_group"),
response = c(A_group, B_group, C_group)
)
# Step 3. T-Test
A_B <- t.test(A_group, B_group)
B_C <- t.test(B_group, C_group)
A_C <- t.test(A_group, C_group)
B_C
##
## Welch Two Sample t-test
##
## data: B_group and C_group
## t = -1.0132, df = 4.9682, p-value = 0.3577
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -10.120394 4.405754
## sample estimates:
## mean of x mean of y
## -0.3847862 2.4725339
t_com <- data.frame(
value = c("t_value", "p_value"),
A_B = c(-0.93273, 0.3803),
B_A = c(-1.0132, -1.0132),
A_C = c(-2.7043, 0.03746)
)
# Step 4. ANOVA Test
aov_com <- summary(aov(response ~ factor, data = com_data))
# Step 5. Compare two tests
com_df <- data.frame(
value = c("t_value", "p_value"),
A_B = c(-0.93273, 0.3803),
B_A = c(-1.0132, 0.3577),
A_C = c(-2.7043, 0.03746),
anova = c(2.233, 0.15)
)
com_df
## value A_B B_A A_C anova
## 1 t_value -0.93273 -1.0132 -2.70430 2.233
## 2 p_value 0.38030 0.3577 0.03746 0.150
F_Distribution is a right-skewed distribution used mostly in ANOVA.
Let’s compare two graphs.
par(mfrow = c(1,2))
# Create the vector x
f <- seq(from = 0, to = 2, length = 200)
# Evaluate the densities
y_1 <- df(f, 1, 1)
y_2 <- df(f, 3, 1)
y_3 <- df(f, 6, 1)
y_4 <- df(f, 3, 3)
y_5 <- df(f, 6, 3)
y_6 <- df(f, 3, 6)
y_7 <- df(f, 6, 6)
# Plot the densities
plot(f, y_1, col = 1, type = "l", lwd = 2, xlab = "f-value", ylab = "Density", main = "Comparison of F-distributions")
lines(f, y_2, col = 2)
lines(f, y_3, col = 3)
lines(f, y_4, col = 4)
lines(f, y_5, col = 5)
lines(f, y_6, col = 6)
lines(f, y_7, col = 7)
# Add the legend
legend("topright", title = "F_distributions",
c("df = (1,1)", "df = (3,1)", "df = (6,1)", "df = (3,3)",
"df = (6,3)", "df = (3,6)", "df = (6,6)"),
col = c(1, 2, 3, 4, 5, 6, 7), lty = 1)
# Generate a vector of 100 values between -4 and 4
t <- seq(-4, 4, length = 100)
# Simulate the t-distribution
t_1 <- dt(t, df = 4)
t_2 <- dt(t, df = 6)
t_3 <- dt(t, df = 8)
t_4 <- dt(t, df = 10)
t_5 <- dt(t, df = 12)
# Plot the t-distributions
plot(t, t_1, type = "l", lwd = 2, xlab = "t-value", ylab = "Density",
main = "Comparison of T-distributions", col = "black")
lines(t, t_2, col = "red")
lines(t, t_3, col = "orange")
lines(t, t_4, col = "green")
lines(t, t_5, col = "blue")
# Add a legend
legend("topright", c("df = 4", "df = 6", "df = 8", "df = 10", "df = 12"),
col = c("black", "red", "orange", "green", "blue"),
title = "T-Distribution", lty = 1)
What do you see from both graphs? Both graphs are used to test whether Null hypothesis is acceted or not by p_value.
F-value is done by one-sided test, T-value is done by two-sided test.
Regarding the degrees of freedom, the degrees of freedom in F-distribution is made up two elements, (e.g. df(df1, df2)). The df1 refers to degrees of freedom numerator relates to groups or samples. The df2 refers to degrees of freedom denominator relates to total observation.
Now, what’s different between two graphs? F-value is a ratio of variances and variances are always non-negative numbers. The distribution represents the ratio between the variance between groups and the variance within groups. Whereas T-value is a ratio of variances by substrating between two variances. Negative values takes place simply in a substraction in the wrong order (smallest sample minus biggest sample).
wm <- read.csv("Data Files/working_memory.csv", stringsAsFactors = FALSE)
names(wm) <- c("subject", "condition", "iq")
wm$condition <- as.factor(wm$condition)
levels(wm$condition) <- c("8 days", "12 days", "17 days", "19 days")
## using ggplot
library(ggplot2)
# Basic box plot
wm_p <- ggplot(wm, aes(x = condition, y = iq, fill = condition)) +
geom_boxplot() +
labs(title="Plot of IQ gain per days",x = "Condition (Days)", y = "IQ")
wm_p +
geom_jitter(shape = 16, position=position_jitter(0.2)) +
theme(plot.title = element_text(hjust = 0.5, size = 20))
# Step 1. Define number of subjects in each group
length <- nrow(subset(wm, wm$condition == "19 days"))
# Step 2. Group mean, Grouping function
group_mean <- tapply(wm$iq, wm$condition, mean)
# Step 3. Entire mean
entire_mean <- mean(wm$iq)
# Step 4. Calculate the sum of squares
between_sum.squares <- length * sum((group_mean - entire_mean)^2)
# Step 1. Divide groups because we need to calculate.
levels(wm$condition)
## [1] "8 days" "12 days" "17 days" "19 days"
wm_8.days <- subset(wm$iq, wm$condition == "8 days")
wm_12.days <- subset(wm$iq, wm$condition == "12 days")
wm_17.days <- subset(wm$iq, wm$condition == "17 days")
wm_19.days <- subset(wm$iq, wm$condition == "19 days")
# Step 2. Substract group means from the individual values in each group.
wm_8.days_mean_diff <- wm_8.days - group_mean[1]
wm_12.days_mean_diff <- wm_12.days - group_mean[2]
wm_17.days_mean_diff <- wm_17.days - group_mean[3]
wm_19.days_mean_diff <- wm_19.days - group_mean[4]
# Step 3. Put everything back together into one vector
within_squares <- c(wm_8.days_mean_diff, wm_12.days_mean_diff, wm_17.days_mean_diff, wm_19.days_mean_diff)
# Step 4. Calculate the sum of squares using group_squares
within_sum.squares <- sum(within_squares^2)
# Number of between groups
between_number <- length(levels(wm$condition))
# Number of subjects in within group
within_number <- length(wm_17.days) # representative
# Define degrees of freedom (df)
between_number_df <- between_number - 1
within_number_df <- between_number * (within_number - 1)
# Calculate mean squares
MS_between <- between_sum.squares / between_number_df
MS_within <- within_sum.squares / within_number_df
# Calculate the F-ratio
f_ration <- MS_between / MS_within
f_ration
## [1] 10.49317
# use wm above.
# Apply the aov function
anova_wm <- aov(wm$iq ~ wm$condition)
# Look at the summary table of the result
summary(anova_wm)
## Df Sum Sq Mean Sq F value Pr(>F)
## wm$condition 3 196.1 65.36 10.49 7.47e-06 ***
## Residuals 76 473.4 6.23
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(car)
## Warning: package 'car' was built under R version 3.4.1
# Levene's test
leveneTest(wm$iq, wm$condition)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 0.1405 0.9355
## 76
# Levene's test with center = mean
leveneTest(wm$iq, wm$condition, center = mean)
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 3 0.1598 0.923
## 76
ANOVA test, Is the F-value significant? Yes it is. The F-value is 10.49, which is really large and the p-value is really small. As a result, you have 10.49 times as much between group variance as within group variance, so you have a big effect. But, we have to check. If samples have equal variance.
*Leven_s test, You have a large p-value, so you cannot reject the null hypothesis. This means that the assumption of homogeneity of variance holds. Null Hypothesis is population variances are equal. If p-value is higher than 0.05, Null hypothesis is accepted, then the basic assumption of anova test which is all groups have equal variance.
Levene’s test ( Levene 1960) is used to test if k samples have equal variances. Equal variances across samples is called homogeneity of variance.