1.0 Overview

Analysis of Variance (ANOVA) is a statistical technique, commonly used to studying differences between two or more group means. ANOVA test is centred on the different sources of variation in a typical variable. ANOVA in R primarily provides evidence of the existence of the mean equality between the groups. This statistical method is an extension of the t-test. It is used in a situation where the factor variable has more than one group. The idea behind the ANOVA test is very simple: if the average variation between groups is large enough compared to the average variation within groups, then you could conclude that at least one group mean is not equal to the others.

Thus, it’s possible to evaluate whether the differences between the group means are significant by comparing the two variance estimates. This is why the method is called analysis of variance even though the main goal is to compare the group means.

Briefly, the mathematical procedure behind the ANOVA test is as follow:

Compute the within-group variance, also known as residual variance. This tells us, how different each participant is from their own group mean.Compute the variance between group means.

1.1 Assumptions

The ANOVA test makes the following assumptions about the data:

Independence of the observations: Each subject should belong to only one group. There is no relationship between the observations in each group. Having repeated measures for the same participants is not allowed.

No significant outliers: Theres should not be extreme values in any cell of the design.

Normality: The data for each design cell should be approximately normally distributed.

Homogeneity of variances: The variance of the outcome variable should be equal in every cell of the design. Before computing ANOVA test, you need to perform some preliminary tests to check if the assumptions are met.

2.0 Anova for the Survey Questions

One-way ANOVA: Here we shall be conducting a one way ANOVA to see if there is a significant difference in the responses of the 4 category of questions in the survey by 5 groups of respondants in the survey.

The 4 category of the survey are as- Communication, Service Delivery, Facilities and Equipment and Information Resources

The 5 Respondatnts of the survey are as- Undergrads, Postgrads, Faculty, Exchnge Students and Others

2.1 Loading Relevant Packages

The following snippet of code is used to load the relevant packages.

library(plotly)
library(tidyverse)
library(ggstatsplot)
library(ggpubr)
library(rstatix)
library(DT)

2.2 Loading the dataset

The library dataset is imported in R console to perform the PCA

d1<-read_csv("Data/Raw data 2018-03-07 SMU LCS data file - KLG.csv")
Categories <- read_csv("data/ItemCategories.csv")
datatable(Categories,rownames = FALSE, class = 'table-bordered',caption = 'Table 1: Questions and Corresponding Service Categories.')

2.2 Data Preperation for Analysis

d1$cat <- ifelse(d1$Position==1 | d1$Position== 2 |d1$Position== 3 | d1$Position== 4, "Undergrads",
               ifelse(d1$Position==5 , "Exchange",
                      ifelse(d1$Position== 6 | d1$Position== 7, "PostGrads",
                             ifelse(d1$Position== 8 | d1$Position== 9| d1$Position== 10| d1$Position== 11| d1$Position== 12, "Faculty",
                                    ifelse(d1$Position== 13 |d1$Position== 14,"Others",
                                     NA  )))) )

Selecting the Improvement score for Analysis

d2<-  d1 %>% select(ResponseID,cat,starts_with("I"),-ID)
d2<-d2[-c(5),] # Dropping the NA value

Preparing the data for Communication Related Questions

#Communications
dc<-d2 %>% select(ResponseID,cat,I01,I02,I03)
dc$Mean_Comm<-rowMeans(dc[,3:5],na.rm =TRUE)
dcbox<-dc[,c(1,2,6)]
dcbox <- na.omit(dcbox)

Preparing the data for Service Delivery Related Questions

#Serice Delivery
ds<-d2 %>% select(ResponseID,cat,I04,I05,I06,I07,I08,I09,I10,I11,I12,I13)
ds$Mean_Serv<-rowMeans(ds[,3:12],na.rm =TRUE)
dsbox<-ds[,c(1,2,13)]
dsbox <- na.omit(dsbox)

Preparing the data for Facilities and Information Related Questions

#Facilities and Information
dfac<-d2 %>% select(ResponseID,cat,I14,I15,I16,I17,I18,I19,I20)
dfac$Mean_Fac<-rowMeans(dfac[,3:9],na.rm =TRUE)
dfacbox<-dfac[,c(1,2,10)]
dfacbox <- na.omit(dfacbox)

Preparing the data for Information Resources Related Questions

#Information Resources
dinf<-d2 %>% select(ResponseID,cat,I21,I22,I23,I24,I25,I26)
dinf$Mean_Inf<-rowMeans(dinf[,3:8],na.rm =TRUE)
dinfbox<-dinf[,c(1,2,9)]
dinfbox <- na.omit(dinfbox)

2.3 Checking the Assumptions

Summary of the Communication data

df1<-dcbox %>%
  group_by(cat) %>%
  get_summary_stats(Mean_Comm, type = "mean_sd")
df1

Summary of the Service data

df2<-dsbox %>%
  group_by(cat) %>%
  get_summary_stats(Mean_Serv, type = "mean_sd")
df2

Summary of the Facility data

df3<-dfacbox %>%
  group_by(cat) %>%
  get_summary_stats(Mean_Fac, type = "mean_sd")
df3

Summary of the Information data

df3<-dinfbox %>%
  group_by(cat) %>%
  get_summary_stats(Mean_Inf, type = "mean_sd")
df3

Outlier Check of the Communication data

#Identify for Outlier
out1<-dcbox %>%
  group_by(cat) %>%
  identify_outliers(Mean_Comm)
out1

Outlier Check of the Service data

#Identify for Outlier
out2<-dsbox %>%
  group_by(cat) %>%
  identify_outliers(Mean_Serv)
out2

Outlier Check of the Facility data

#Identify for Outlier
out3<-dfacbox %>%
  group_by(cat) %>%
  identify_outliers(Mean_Fac)
out3

Outlier Check of the Information data

#Identify for Outlier
out4<-dinfbox %>%
  group_by(cat) %>%
  identify_outliers(Mean_Inf)
out4

Normality Check of the data

model1  <- lm(Mean_Comm ~ cat, data = dcbox)
model2  <- lm(Mean_Serv ~ cat, data = dsbox)
model3  <- lm(Mean_Fac ~ cat, data = dfacbox)
model4  <- lm(Mean_Inf ~ cat, data = dinfbox)

n1<- ggqqplot(dcbox, "Mean_Comm", facet.by = "cat",title  ="Test for Normality for Communication")
s1<-shapiro_test(residuals(model1))
n1

n2<- ggqqplot(dsbox, "Mean_Serv", facet.by = "cat")
s2<-shapiro_test(residuals(model2))
n2

n3<- ggqqplot(dfacbox, "Mean_Fac", facet.by = "cat")
s3<-shapiro_test(residuals(model3))
n3

n4<- ggqqplot(dinfbox, "Mean_Inf", facet.by = "cat")
s4<-shapiro_test(residuals(model4))
n4

2.4 Conductive the ANOVA

res.aov1 <- dcbox %>% anova_test(Mean_Comm ~ cat)

## Coefficient covariances computed by hccm()

res.aov1

res.aov2 <- dsbox %>% anova_test(Mean_Serv ~ cat)

## Coefficient covariances computed by hccm()

res.aov2

res.aov3 <- dfacbox %>% anova_test(Mean_Fac ~ cat)

## Coefficient covariances computed by hccm()

res.aov3

res.aov4 <- dinfbox %>% anova_test(Mean_Inf ~ cat)

## Coefficient covariances computed by hccm()

res.aov4

3.0 Plotting the output of Anova

Anova for Communication Related Questions

set.seed(123)
g1 <- ggstatsplot::ggbetweenstats(
  data = dcbox,
  x = cat,
  y = Mean_Comm,
  mean.plotting = TRUE,
  mean.ci = TRUE,
  pairwise.comparisons = TRUE, 
  notch = FALSE,
  type = "np",
  k=3,
  title = "Differences in mean ratings  \n(Communication)",
  messages = FALSE) 
g1

#Code for interactive plot
g2 <- plotly::ggplotly(g1, tooltip=c("text","x","y"))
g2 <- g2 %>% layout(yaxis= list(title = "Mean Ratingsof the respondents", 
                                    titlefont=list(family='Arial', size=12),
                                    tickfont=list(family='Arial', size = 13)),
                        xaxis=list(tickfont=list(family='Arial', size = 11))
                        )

Anova for Service Related Questions

set.seed(123)
g3 <- ggstatsplot::ggbetweenstats(
  data = dsbox,
  x = cat,
  y = Mean_Serv,
  mean.plotting = TRUE,
  mean.ci = TRUE,
  pairwise.comparisons = TRUE, 
  notch = FALSE,
  type = "np",
  k=3,
  title = "Differences in mean ratings  \n(Service)",
  messages = FALSE) 
g3

#Code for interactive plot


g4 <- plotly::ggplotly(g3, tooltip=c("text","x","y"))
g4 <- g4 %>% layout(yaxis= list(title = "Mean Ratingsof the respondents", 
                                    titlefont=list(family='Arial', size=12),
                                    tickfont=list(family='Arial', size = 13)),
                        xaxis=list(tickfont=list(family='Arial', size = 11))
                        )

Anova for Facility Related Questions

set.seed(123)
g5 <- ggstatsplot::ggbetweenstats(
  data = dfacbox,
  x = cat,
  y = Mean_Fac,
  mean.plotting = TRUE,
  mean.ci = TRUE,
  pairwise.comparisons = TRUE, 
  notch = FALSE,
  type = "np",
  k=3,
  title = "Differences in mean ratings  \n(Facility)",
  messages = FALSE) 
g5

#Code for interactive plot

g6 <- plotly::ggplotly(g5, tooltip=c("text","x","y"))
g6 <- g6 %>% layout(yaxis= list(title = "Mean Ratingsof the respondents", 
                                    titlefont=list(family='Arial', size=12),
                                    tickfont=list(family='Arial', size = 13)),
                        xaxis=list(tickfont=list(family='Arial', size = 11))
                        )

Anova for Information Related Questions

set.seed(123)
g7 <- ggstatsplot::ggbetweenstats(
  data = dinfbox,
  x = cat,
  y = Mean_Inf,
  mean.plotting = TRUE,
  mean.ci = TRUE,
  pairwise.comparisons = TRUE, 
  notch = FALSE,
  type = "np",
  k=3,
  title = "Differences in mean ratings  \n(Information)",
  messages = FALSE) 
g7

#Code for interactive plot

g8 <- plotly::ggplotly(g7, tooltip=c("text","x","y"))

## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomLabelRepel() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues

## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomSignif() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues

g8 <- g8 %>% layout(yaxis= list(title = "Mean Ratingsof the respondents", 
                                    titlefont=list(family='Arial', size=12),
                                    tickfont=list(family='Arial', size = 13)),
                        xaxis=list(tickfont=list(family='Arial', size = 11))
                        )

Plotting the interactive outputs of the 4 Categories of Questions

4.0 Conclusion

The output displays the results of all pairwise comparisons among the tested groups (here 5 groups) You’ll find the actual difference between the means under diff and the adjusted p-value (p adj) for each pairwise comparison. Looking at the above table , the only significant difference to be reported in the present test is between the means of the groups Faculty and Others as difference in p-value is less than 0.05.

The F-statistic is used to test if the data are from significantly different populations, i.e., different sample means.To compute the F-statistic, you need to divide the between-group variability over the within-group variability.The between-group variability reflects the differences between the groups inside all of the population. Look at the two graphs below to understand the concept of between-group variance.The above graphs shows variation between the five groups.

Anova for Survey Analysis

Shreyansh Shivam

25-Apr-2020, (updated on 03 May 2020)