What if there are more than two groups? Consider the model
\[ y_{ij} = \mu + \alpha_i + e_{ij} \]
where
\[ \sum_{i=1}^{g} \alpha_i = 0 \]
and
\[ e_{ij} \sim N(0,\sigma^{2}). \]
Here,
Then the total sample size is
\[ N = \sum_{i = 1}^{g} n_i \]
Observations from group \(i\) follow a normal distribution
\[ y_{ij} \sim N(\mu_i,\sigma^{2}) \]
where the mean of each group is given by
\[ \mu_i = \mu + \alpha_i. \]
Here \(\alpha_i\) measures the effect of group \(i\). It is the difference between the overall mean and the mean of group \(i\).
Essentially, the assumptions here are the same as the two-sample case, however now, we simply have more groups.
Much like the two-sample case, we would again like to test if the means of the groups are equal.
\[ H_0: \mu_1 = \mu_2 = \ldots \mu_g \quad \text{vs} \quad H_1: \text{ Not all } \mu_i \text{ are equal.} \]
Notice that the alternative simply indicates that some of the means are not equal, not specifically which are not equal. More on that later.
Alternatively, we could write
\[ H_0: \alpha_1 = \alpha_2 = \ldots = \alpha_g = 0 \quad \text{vs} \quad H_1: \text{ Not all } \alpha_i \text{ are } 0. \]
This test is called Analysis of Variance (ANOVA). ANOVA compares the variation due to specific sources (between groups) with the variation among individuals who should be similar (within groups). In particular, ANOVA tests whether several populations have the same mean by comparing how far apart the sample means are with how much variation there is within the samples. We use variability of means to test for equality of means, thus the use of variance in the name for a test about means.
The sum of squared total is
\[ SST = \sum_{i = 1}^{g} \sum_{j = 1}^{n_i} (y_{ij} - \bar{y})^2 \]
The variation between groups looks at how far the individual sample means are from the overall sample mean.
\[ SSB = \sum_{i = 1}^{g} \sum_{j = 1}^{n_i} (\bar{y}_i - \bar{y})^2 = \sum_{i = 1}^{g} n_i (\bar{y}_i - \bar{y})^2 \]
Lastly, the within group variation measures how far observations are from the sample mean of its group.
\[ SSW = \sum_{i = 1}^{g} \sum_{j = 1}^{n_i} (y_{ij} - \bar{y}_i)^2 = \sum_{i = 1}^{g} (n_i - 1) s_{i}^{2} \]
This could also be thought of as the error sum of squares, where \(y_{ij}\) is an observation and \(\bar{y}_i\) is its fitted (predicted) value from the model.
To develop the test statistic for ANOVA, we place this information into an ANVOA table.
| Source | Sum of Squares | Degrees of Freedom | Mean Square | \(F\) |
|---|---|---|---|---|
| Between | SSB | \(g - 1\) | SSB / DFB | MSB / MSW |
| Within | SSW | \(N - g\) | SSW / DFW | |
| Total | SST | \(N - 1\) |
We reject the null (equal means) when the \(F\) statistic is large. This occurs when the variation between groups is large compared to the variation within groups. Under the null hypothesis, the distribution of the test statistic is \(F\) with degrees of freedom \(g - 1\) and \(N - g\).
# load dataset
DATA <- read_csv("antid.csv")
## Rows: 15 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Group
## dbl (1): wellbeing
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
attach(DATA)
View(DATA)
knitr::opts_chunk$set(echo = TRUE)
# Do we have any missing values in the dataset in columnwise direction?
missing_column = colSums(is.na(DATA)) # Check the missing values in the dataset column-wise
print(missing_column)
wellbeing Group
0 0
knitr::opts_chunk$set(echo = TRUE)
missing_values = is.na(DATA) # To look for the total sum of all the missing values in the dataset
total_missing = sum(missing_values)
print(total_missing)
[1] 0
mydata=data.frame(wellbeing, Group)
analysis=aov(wellbeing ~ Group, data=mydata)
summary(analysis)
Df Sum Sq Mean Sq F value Pr(>F)
Group 2 20.13 10.067 5.119 0.0247 *
Residuals 12 23.60 1.967
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretations: Your interpretation is such that Group is statistically significant since the p-value is less than 0.05. It has significant effect on wellbeing.
Alternatively, we can use the following commands below:
anova(analysis)
Analysis of Variance Table
Response: wellbeing
Df Sum Sq Mean Sq F value Pr(>F)
Group 2 20.133 10.0667 5.1186 0.02469 *
Residuals 12 23.600 1.9667
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
res = resid(analysis)
print(res)
1 2 3
0.7999999999999956035168 -0.1999999999999982625010 -1.1999999999999981792342
4 5 6
-1.1999999999999992894573 1.8000000000000004884981 1.7999999999999998223643
7 8 9
-1.2000000000000001776357 0.7999999999999999333866 -1.2000000000000001776357
10 11 12
-0.2000000000000000388578 2.0000000000000000000000 -1.0000000000000002220446
13 14 15
-0.0000000000000001110223 -2.0000000000000000000000 1.0000000000000000000000
SSE = deviance(analysis)
print(SSE)
[1] 23.6
DF = analysis$df
print(DF)
[1] 12
yhat = analysis$fitted.values
print(yhat)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2.2 2.2 2.2 2.2 2.2 3.2 3.2 3.2 3.2 3.2 5.0 5.0 5.0 5.0 5.0
AIC(analysis)
[1] 57.3661
BIC(analysis)
[1] 60.1983
boxplot(mydata$wellbeing ~ mydata$Group, pch=19, col=c(2:7), col.lab="purple", main="Box Plot for Checking Normality Assumption", col.main=22, xlab = "Group", ylab = "Rate of Wellbeing")
### Interpretation: The plot shows that normality is not violated. No
outlier is detected and the data is symmentric and behave well.
qqnorm(res, col=3,lwd=1, pch=19, col.main="blue", col.lab="purple")
qqline(res,col=2,lwd=2)
legend(x="topleft", legend=c("Line of Scatter Points","Line of Best Fit"), col=c(3:2),lwd=2,bg="blue3")
### Interpretation: The plot also suggests that normality is not
violated and the data is symmentric and behave well.
sf.test(res)
Shapiro-Francia normality test
data: res
W = 0.93286, p-value = 0.2554
Remember that the null hypothesis is that THE DATA IS NORMAL.
Reject the null hypothesis if the p-value is less than or equal to 0.05.
Since the p-value is greater than 0.05, we do not reject the null hypothesis. Therefore, the data is normal, indicating that it behaves well.
ks.test(res, "pnorm", mean(res), sd(res))
Asymptotic one-sample Kolmogorov-Smirnov test
data: res
D = 0.17941, p-value = 0.7198
alternative hypothesis: two-sided
Remember that the null hypothesis is that THE DATA IS NORMAL.
Reject the null hypothesis if the p-value is less than or equal to 0.05.
Since the p-value is greater than 0.05, we do not reject the null hypothesis. Therefore, the data is normal, indicating that it behaves well.
lillie.test(res)
Lilliefors (Kolmogorov-Smirnov) normality test
data: res
D = 0.17941, p-value = 0.2167
Remember that the null hypothesis is that THE DATA IS NORMAL.
Reject the null hypothesis if the p-value is less than or equal to 0.05.
Since the p-value is greater than 0.05, we do not reject the null hypothesis. Therefore, the data is normal, indicating that it behaves well.
A residual plot is a scatterplot of the residuals against the predicted values. If the residuals are randomly scattered around the horizontal line (zero), then the assumption of homoscedasticity is met. Otherwise, it is not.
plot(analysis$fitted.values, res, pch=19, col=3, xlab = "Predicted values", ylab = "Residuals",
main = "Residual Plot", col.main=22, col.lab=99)
abline(h=0, lwd=2)
In the resulting plot, we want to see that the residuals are randomly scattered around the horizontal line at zero, with no discernible pattern or trend. If the variance of the residuals is roughly constant across the range of predicted values, then the homoscedasticity assumption is met. If the variance of the residuals is not constant, then the heteroscedasticity assumption is violated. From the plot, it is affirmed that the homogeneity of variances is met.
bartlett.test(wellbeing ~ Group)
Bartlett test of homogeneity of variances
data: wellbeing by Group
Bartlett's K-squared = 0.1853, df = 2, p-value = 0.9115
Remember that the null hypothesis is that THE DATA IS HOMOGENEOUS.
Reject the null hypothesis if the p-value is less than or equal to 0.05.
Since the p-value is greater than 0.05, we do not reject the null hypothesis. Therefore, the data is homogeneous.
fligner.test(mydata$wellbeing, mydata$Group)
Fligner-Killeen test of homogeneity of variances
data: mydata$wellbeing and mydata$Group
Fligner-Killeen:med chi-squared = 0.39529, df = 2, p-value = 0.8207
### Interpretation:
Remember that the null hypothesis is that THE DATA IS HOMOGENEOUS.
### Decision Rule:
Reject the null hypothesis if the p-value is less than or equal to 0.05.
### Decision:
Since the p-value is greater than 0.05, we do not reject the null hypothesis. Therefore, the data is homogeneous.
### To Check for Homogeneity of Variances Assumption using Fligner-Killeen's Test
``` r
A=TukeyHSD(analysis, which=c("Group","wellbeing"))
print(A)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = wellbeing ~ Group, data = mydata)
$Group
diff lwr upr p adj
Low-High -1.8 -4.166241 0.5662412 0.1474576
Placebo-High -2.8 -5.166241 -0.4337588 0.0209244
Placebo-Low -1.0 -3.366241 1.3662412 0.5162761