Module 4: ANOVA

ANOVA

ANOVA is a statistical method for analysing the variance in a study. It’s used to look at variations in the dependent variable’s mean values that are linked to the influence of independent variables. ANOVA is a method for comparing the means of two or more winners.

ANOVA can be devided into two parts, one is One way and second one is Two-way ANOVA

The one-way ANOVA contrasts the means of the categories that interested in to see if all of them are statistically substantially different from one another. It examines the null hypothesis.

The mean of a quantitative variable is estimated using a two-way ANOVA based on the levels of two categorical variables.

H0 : μ1 = μ2 =μ3 = …..= μk

HA : μi ≠ μj for some i and j

Where k = number of groups and μ = group mean. If, on the other hand, the one-way ANOVA provides a statistically significant finding, we support the alternate hypothesis (HA), which states that there are at least two statistically significant group means.

One Way Anova

I have used another set of data for one way ANOVA

head(data)

##      Name Fish Doll Toy Others
## 1    John   29   47  16     25
## 2    Duke   29   46  16     26
## 3   Chris   29   46  16     24
## 4 Charles   28   46  16     27
## 5   Narin   28   46  16     27
## 6   David   28   46  16     21

Lets add above values into R

data = data.frame("A" = c(29,47,16,25), "B" = c(29,46,16,26), "C" = c(29,46,16,24), "D" = c(28,46,16,25),"E" = c(28,46,16,27),"Gift"=1:4)
data

##    A  B  C  D  E Gift
## 1 29 29 29 28 28    1
## 2 47 46 46 46 46    2
## 3 16 16 16 16 16    3
## 4 25 26 24 25 27    4

I have organized this information. It took me a long time to figure out the right code, but I finally did so here. By the way, I don’t believe I was required to include the plant number, but I did.

library(readr)
test <-
  data %>% 
  pivot_longer(c('A','B','C','D','E'), names_to = "Doll", values_to = "Others")
test

## # A tibble: 20 x 3
##     Gift Doll  Others
##    <int> <chr>  <dbl>
##  1     1 A         29
##  2     1 B         29
##  3     1 C         29
##  4     1 D         28
##  5     1 E         28
##  6     2 A         47
##  7     2 B         46
##  8     2 C         46
##  9     2 D         46
## 10     2 E         46
## 11     3 A         16
## 12     3 B         16
## 13     3 C         16
## 14     3 D         16
## 15     3 E         16
## 16     4 A         25
## 17     4 B         26
## 18     4 C         24
## 19     4 D         25
## 20     4 E         27

Compute Df Sum Sq Mean Sq F value Pr(>F):

fm <- aov(Others ~ Doll, test)
summary(fm)

##             Df Sum Sq Mean Sq F value Pr(>F)
## Doll         4    1.2     0.3   0.002      1
## Residuals   15 2395.7   159.7

Display the data within the table (Stadium Name and Total goal count).

Chi-squared test using one categorical variables

chisq.test(data$A)

## 
##  Chi-squared test for given probabilities
## 
## data:  data$A
## X-squared = 17.393, df = 3, p-value = 0.0005866

Here alpha is less than 0.5, hence unable to reject my null hypothesis.

Visualize the data with ggplot

library(ggmosaic)
library(ggplot2)
ggplot(test , aes(x = Doll, y = Others)) +
  geom_boxplot()

Check the homogeneity of variance assumption. The residuals versus fits plot can be used to check the homogeneity of variances.

plot(fm, 1:2)

In the above graph displayed a relationship. Some points are fall on that line. All are close enough to continue with our results.

For more visualization I have expressed the differences in these means.

y1 <- mean(data$Doll, na.rm = TRUE)

## Warning in mean.default(data$Doll, na.rm = TRUE): argument is not numeric or
## logical: returning NA

y2 <-  mean(data$C, na.rm = TRUE)
ggplot(test , aes(x = Doll, y = Others)) +
  geom_point() + geom_jitter(color = 'grey') +
  stat_summary(fun.data = 'mean_se',color = "magenta") +
  geom_hline(yintercept = y2, color ="blue", linetype = "dashed")

Two Way Anova

I have used previous assignments data Using two categorical variables preform a test for independence.

summary(data1)

##    date_GMT           referee          total_goal_count
##  Length:380         Length:380         Min.   :0.000   
##  Class :character   Class :character   1st Qu.:2.000   
##  Mode  :character   Mode  :character   Median :3.000   
##                                        Mean   :2.821   
##                                        3rd Qu.:4.000   
##                                        Max.   :8.000   
##  total_goals_at_half_time  total_minute stadium_name      
##  Min.   :0.000            Min.   :90    Length:380        
##  1st Qu.:0.000            1st Qu.:90    Class :character  
##  Median :1.000            Median :90    Mode  :character  
##  Mean   :1.253            Mean   :90                      
##  3rd Qu.:2.000            3rd Qu.:90                      
##  Max.   :6.000            Max.   :90

Display the data within the table (Stadium Name and Total goal count).

table(data1$referee, data1$total_minute)

##                  
##                   90
##   Andre Marriner  27
##   Andy Madley      2
##   Anthony Taylor  32
##   Chris Kavanagh  24
##   Craig Pawson    26
##   David Coote     11
##   Graham Scott    17
##   Jonathan Moss   27
##   Kevin Friend    27
##   Lee Mason       19
##   Lee Probert     18
##   Martin Atkinson 29
##   Michael Oliver  30
##   Mike Dean       29
##   Paul Tierney    24
##   Roger East      10
##   Simon Hooper     8
##   Stuart Attwell  20

Chi-squared test using two categorical variables

chisq.test(table(data1$referee, data1$total_minute))

## 
##  Chi-squared test for given probabilities
## 
## data:  table(data1$referee, data1$total_minute)
## X-squared = 59.768, df = 17, p-value = 1.148e-06

Here alpha is greater than 0.5, hence it reject my null hypothesis here.

Visualize the table data into ggplot.

y1 <-  mean(data1$total_goal_count, na.rm = TRUE)
ggplot(data1, aes(x = referee, y = total_minute))+
  geom_jitter(color = 'grey') +
  stat_summary(fun.data1 = 'mean_se', color = "red") +
  geom_hline(yintercept = y1,  color = "blue",linetype = "dashed")

## Warning: Ignoring unknown parameters: fun.data1

## No summary function supplied, defaulting to `mean_se()`

Time to run the ANOVA

twoWayAnova <- aov(total_goal_count ~ total_goals_at_half_time*date_GMT, data = data1)
summary(twoWayAnova)

##                                    Df Sum Sq Mean Sq F value Pr(>F)    
## total_goals_at_half_time            1  429.0   429.0 255.154 <2e-16 ***
## date_GMT                          211  261.0     1.2   0.736  0.970    
## total_goals_at_half_time:date_GMT  57   96.8     1.7   1.010  0.473    
## Residuals                         110  185.0     1.7                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So here we examined if cut, color and the interaction between the two will have an effect on the tournament.

plot(twoWayAnova, 1:5)

## Warning: not plotting observations with leverage one:
##   2, 5, 7, 8, 9, 10, 11, 16, 17, 18, 19, 20, 21, 23, 26, 27, 28, 29, 30, 31, 37, 38, 39, 40, 41, 47, 48, 49, 50, 51, 58, 59, 60, 69, 70, 71, 77, 78, 79, 80, 81, 88, 89, 90, 96, 99, 100, 101, 106, 107, 108, 109, 110, 111, 116, 117, 118, 127, 128, 129, 130, 131, 137, 138, 139, 140, 143, 144, 149, 150, 151, 157, 158, 159, 160, 161, 167, 168, 169, 170, 171, 172, 179, 180, 181, 188, 189, 190, 196, 197, 198, 199, 200, 201, 202, 203, 205, 208, 209, 210, 211, 217, 218, 219, 220, 221, 228, 229, 230, 235, 236, 237, 238, 239, 240, 241, 247, 248, 249, 250, 251, 252, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 271, 273, 274, 275, 280, 286, 287, 288, 289, 290, 294, 296, 297, 298, 299, 303, 304, 305, 311, 312, 313, 314, 320, 324, 325, 326, 327, 332, 333, 334, 335, 336, 337, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

Module 4: ANOVA

Pujan

22/4/2021

ANOVA

One Way Anova

Two Way Anova