One-Way ANOVA

What if there are more than two groups? Consider the model

\[ y_{ij} = \mu + \alpha_i + e_{ij} \]

where

\[ \sum_{i=1}^{g} \alpha_i = 0 \]

and

\[ e_{ij} \sim N(0,\sigma^{2}). \]

Here,

Then the total sample size is

\[ N = \sum_{i = 1}^{g} n_i \]

Observations from group \(i\) follow a normal distribution

\[ y_{ij} \sim N(\mu_i,\sigma^{2}) \]

where the mean of each group is given by

\[ \mu_i = \mu + \alpha_i. \]

Here \(\alpha_i\) measures the effect of group \(i\). It is the difference between the overall mean and the mean of group \(i\).

Essentially, the assumptions here are the same as the two-sample case, however now, we simply have more groups.

Much like the two-sample case, we would again like to test if the means of the groups are equal.

\[ H_0: \mu_1 = \mu_2 = \ldots \mu_g \quad \text{vs} \quad H_1: \text{ Not all } \mu_i \text{ are equal.} \]

Notice that the alternative simply indicates that some of the means are not equal, not specifically which are not equal. More on that later.

Alternatively, we could write

\[ H_0: \alpha_1 = \alpha_2 = \ldots = \alpha_g = 0 \quad \text{vs} \quad H_1: \text{ Not all } \alpha_i \text{ are } 0. \]

This test is called Analysis of Variance (ANOVA). ANOVA compares the variation due to specific sources (between groups) with the variation among individuals who should be similar (within groups). In particular, ANOVA tests whether several populations have the same mean by comparing how far apart the sample means are with how much variation there is within the samples. We use variability of means to test for equality of means, thus the use of variance in the name for a test about means.

The sum of squared total is

\[ SST = \sum_{i = 1}^{g} \sum_{j = 1}^{n_i} (y_{ij} - \bar{y})^2 \]

The variation between groups looks at how far the individual sample means are from the overall sample mean.

\[ SSB = \sum_{i = 1}^{g} \sum_{j = 1}^{n_i} (\bar{y}_i - \bar{y})^2 = \sum_{i = 1}^{g} n_i (\bar{y}_i - \bar{y})^2 \]

Lastly, the within group variation measures how far observations are from the sample mean of its group.

\[ SSW = \sum_{i = 1}^{g} \sum_{j = 1}^{n_i} (y_{ij} - \bar{y}_i)^2 = \sum_{i = 1}^{g} (n_i - 1) s_{i}^{2} \]

This could also be thought of as the error sum of squares, where \(y_{ij}\) is an observation and \(\bar{y}_i\) is its fitted (predicted) value from the model.

To develop the test statistic for ANOVA, we place this information into an ANVOA table.

Source Sum of Squares Degrees of Freedom Mean Square \(F\)
Between SSB \(g - 1\) SSB / DFB MSB / MSW
Within SSW \(N - g\) SSW / DFW
Total SST \(N - 1\)

We reject the null (equal means) when the \(F\) statistic is large. This occurs when the variation between groups is large compared to the variation within groups. Under the null hypothesis, the distribution of the test statistic is \(F\) with degrees of freedom \(g - 1\) and \(N - g\).

# load dataset
DATA <- read_csv("antid.csv")
## Rows: 15 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Group
## dbl (1): wellbeing
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
attach(DATA)
View(DATA)

Data Preprocessing

knitr::opts_chunk$set(echo = TRUE)

# Do we have any missing values in the dataset in columnwise direction?
missing_column = colSums(is.na(DATA))    # Check the missing values in the dataset column-wise
print(missing_column)
wellbeing     Group 
        0         0 

If yes to above, how many missing values altogether?

knitr::opts_chunk$set(echo = TRUE)

missing_values = is.na(DATA)   # To look for the total sum of all the missing values in the dataset
total_missing = sum(missing_values)
print(total_missing)   
[1] 0

Performing ANOVA

mydata=data.frame(wellbeing, Group)
analysis=aov(wellbeing ~ Group, data=mydata)
summary(analysis)
            Df Sum Sq Mean Sq F value Pr(>F)  
Group        2  20.13  10.067   5.119 0.0247 *
Residuals   12  23.60   1.967                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretations: Your interpretation is such that Group is statistically significant since the p-value is less than 0.05. It has significant effect on wellbeing.

Alternatively, we can use the following commands below:

ANALYSIS OF VARIANCE (ANOVA)

anova(analysis)
Analysis of Variance Table

Response: wellbeing
          Df Sum Sq Mean Sq F value  Pr(>F)  
Group      2 20.133 10.0667  5.1186 0.02469 *
Residuals 12 23.600  1.9667                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To obtain the residuals, use the command below:

res = resid(analysis)
print(res) 
                        1                         2                         3 
 0.7999999999999956035168 -0.1999999999999982625010 -1.1999999999999981792342 
                        4                         5                         6 
-1.1999999999999992894573  1.8000000000000004884981  1.7999999999999998223643 
                        7                         8                         9 
-1.2000000000000001776357  0.7999999999999999333866 -1.2000000000000001776357 
                       10                        11                        12 
-0.2000000000000000388578  2.0000000000000000000000 -1.0000000000000002220446 
                       13                        14                        15 
-0.0000000000000001110223 -2.0000000000000000000000  1.0000000000000000000000 

To Obtain Sum of Square Error:

SSE = deviance(analysis)
print(SSE)
[1] 23.6

To Obtain Degree of Freedom for Error:

DF = analysis$df
print(DF)
[1] 12

To Obtain the Fitted Values for the Response Variable:

yhat = analysis$fitted.values
print(yhat)
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 
2.2 2.2 2.2 2.2 2.2 3.2 3.2 3.2 3.2 3.2 5.0 5.0 5.0 5.0 5.0 

To obtain AIC and BIC:

AIC(analysis)
[1] 57.3661
BIC(analysis) 
[1] 60.1983

NORMALITY ASSUMPTION

To Check for Normality Assumption using Box Plot

boxplot(mydata$wellbeing ~ mydata$Group, pch=19, col=c(2:7), col.lab="purple", main="Box Plot for Checking Normality Assumption", col.main=22, xlab = "Group", ylab = "Rate of Wellbeing")

### Interpretation: The plot shows that normality is not violated. No outlier is detected and the data is symmentric and behave well.

To Check for Normality Assumption using Normal Q-Q Plot

qqnorm(res, col=3,lwd=1, pch=19, col.main="blue", col.lab="purple")
qqline(res,col=2,lwd=2)
legend(x="topleft", legend=c("Line of Scatter Points","Line of Best Fit"), col=c(3:2),lwd=2,bg="blue3")

### Interpretation: The plot also suggests that normality is not violated and the data is symmentric and behave well.

To Check for Normality Assumption using Shapiro-Francia Test

sf.test(res)

    Shapiro-Francia normality test

data:  res
W = 0.93286, p-value = 0.2554

Interpretation:

Remember that the null hypothesis is that THE DATA IS NORMAL.

Decision Rule:

Reject the null hypothesis if the p-value is less than or equal to 0.05.

Decision:

Since the p-value is greater than 0.05, we do not reject the null hypothesis. Therefore, the data is normal, indicating that it behaves well.

To Check for Normality Assumption using Kolmogorov-Smirnov Test

ks.test(res, "pnorm", mean(res), sd(res))

    Asymptotic one-sample Kolmogorov-Smirnov test

data:  res
D = 0.17941, p-value = 0.7198
alternative hypothesis: two-sided

Interpretation:

Remember that the null hypothesis is that THE DATA IS NORMAL.

Decision Rule:

Reject the null hypothesis if the p-value is less than or equal to 0.05.

Decision:

Since the p-value is greater than 0.05, we do not reject the null hypothesis. Therefore, the data is normal, indicating that it behaves well.

To Check for Normality Assumption using Lillie Test Test

lillie.test(res)

    Lilliefors (Kolmogorov-Smirnov) normality test

data:  res
D = 0.17941, p-value = 0.2167

Interpretation:

Remember that the null hypothesis is that THE DATA IS NORMAL.

Decision Rule:

Reject the null hypothesis if the p-value is less than or equal to 0.05.

Decision:

Since the p-value is greater than 0.05, we do not reject the null hypothesis. Therefore, the data is normal, indicating that it behaves well.

HOMOGENEITY OF VARIANCES ASSUMPTION

To Check for Homogeneity of Variances Assumption using Residual Plot

A residual plot is a scatterplot of the residuals against the predicted values. If the residuals are randomly scattered around the horizontal line (zero), then the assumption of homoscedasticity is met. Otherwise, it is not.

plot(analysis$fitted.values, res, pch=19, col=3, xlab = "Predicted values", ylab = "Residuals", 
     main = "Residual Plot", col.main=22, col.lab=99)

abline(h=0, lwd=2)

Interpretation:

In the resulting plot, we want to see that the residuals are randomly scattered around the horizontal line at zero, with no discernible pattern or trend. If the variance of the residuals is roughly constant across the range of predicted values, then the homoscedasticity assumption is met. If the variance of the residuals is not constant, then the heteroscedasticity assumption is violated. From the plot, it is affirmed that the homogeneity of variances is met.

To Check for Homogeneity of Variances Assumption using Bartlett’s Test

bartlett.test(wellbeing ~ Group)

    Bartlett test of homogeneity of variances

data:  wellbeing by Group
Bartlett's K-squared = 0.1853, df = 2, p-value = 0.9115

Interpretation:

Remember that the null hypothesis is that THE DATA IS HOMOGENEOUS.

Decision Rule:

Reject the null hypothesis if the p-value is less than or equal to 0.05.

Decision:

Since the p-value is greater than 0.05, we do not reject the null hypothesis. Therefore, the data is homogeneous.

To Check for Homogeneity of Variances Assumption using Fligner-Killeen’s Test

fligner.test(mydata$wellbeing, mydata$Group)

    Fligner-Killeen test of homogeneity of variances

data:  mydata$wellbeing and mydata$Group
Fligner-Killeen:med chi-squared = 0.39529, df = 2, p-value = 0.8207
### Interpretation:
Remember that the null hypothesis is that THE DATA IS HOMOGENEOUS. 

### Decision Rule:
Reject the null hypothesis if the p-value is less than or equal to 0.05.

### Decision:
Since the p-value is greater than 0.05, we do not reject the null hypothesis. Therefore, the data is homogeneous.




### To Check for Homogeneity of Variances Assumption using Fligner-Killeen's Test

``` r
 A=TukeyHSD(analysis, which=c("Group","wellbeing"))
 
 print(A)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = wellbeing ~ Group, data = mydata)

$Group
             diff       lwr        upr     p adj
Low-High     -1.8 -4.166241  0.5662412 0.1474576
Placebo-High -2.8 -5.166241 -0.4337588 0.0209244
Placebo-Low  -1.0 -3.366241  1.3662412 0.5162761