ANOVA is a statistical method for analysing the variance in a study. It’s used to look at variations in the dependent variable’s mean values that are linked to the influence of independent variables. ANOVA is a method for comparing the means of two or more winners.
ANOVA can be devided into two parts, one is One way and second one is Two-way ANOVA
The one-way ANOVA contrasts the means of the categories that interested in to see if all of them are statistically substantially different from one another. It examines the null hypothesis.
The mean of a quantitative variable is estimated using a two-way ANOVA based on the levels of two categorical variables.
$$
H0 : μ1 = μ2 =μ3 = …..= μk
HA : μi ≠μj for some i and j
$$
Where k = number of groups and μ = group mean. If, on the other hand, the one-way ANOVA provides a statistically significant finding, we support the alternate hypothesis (HA), which states that there are at least two statistically significant group means.
head(data)
## Name Fish Doll Toy Others
## 1 John 29 47 16 25
## 2 Duke 29 46 16 26
## 3 Chris 29 46 16 24
## 4 Charles 28 46 16 25
## 5 Narin 28 46 16 27
## 6 David 28 46 16 21
Lets add above values into R
data = data.frame("A" = c(29,47,16,25), "B" = c(29,46,16,26), "C" = c(29,46,16,24), "D" = c(28,46,16,25),"E" = c(28,46,16,27),"Gift"=1:4)
data
## A B C D E Gift
## 1 29 29 29 28 28 1
## 2 47 46 46 46 46 2
## 3 16 16 16 16 16 3
## 4 25 26 24 25 27 4
I have organized this information. It took me a long time to figure out the right code, but I finally did so here. By the way, I don’t believe I was required to include the plant number, but I did.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
test <-
data %>%
pivot_longer(c('A','B','C','D','E'), names_to = "Doll", values_to = "Others")
test
## # A tibble: 20 x 3
## Gift Doll Others
## <int> <chr> <dbl>
## 1 1 A 29
## 2 1 B 29
## 3 1 C 29
## 4 1 D 28
## 5 1 E 28
## 6 2 A 47
## 7 2 B 46
## 8 2 C 46
## 9 2 D 46
## 10 2 E 46
## 11 3 A 16
## 12 3 B 16
## 13 3 C 16
## 14 3 D 16
## 15 3 E 16
## 16 4 A 25
## 17 4 B 26
## 18 4 C 24
## 19 4 D 25
## 20 4 E 27
Compute Df Sum Sq Mean Sq F value Pr(>F):
fm <- aov(Others ~ Doll, test)
summary(fm)
## Df Sum Sq Mean Sq F value Pr(>F)
## Doll 4 1.2 0.3 0.002 1
## Residuals 15 2395.7 159.7
Visualize the data with ggplot
library(ggpubr)
## Loading required package: ggplot2
library(ggmosaic)
library(ggplot2)
ggplot(test , aes(x = Doll, y = Others)) +
geom_boxplot()
Check the homogeneity of variance assumption. The residuals versus fits plot can be used to check the homogeneity of variances.
plot(fm, 1:2)
In the above graph displayed a relationship. Some points are fall on that line. All are close enough to continue with our results.
For more visualization I have expressed the differences in these means.
y1 <- mean(data$Doll, na.rm = TRUE)
## Warning in mean.default(data$Doll, na.rm = TRUE): argument is not numeric or
## logical: returning NA
ggplot(test , aes(x = Doll, y = Others)) +
geom_point() +
stat_summary(fun.data = 'mean_se',color = "magenta") +
geom_hline(yintercept = y1, color ="blue", linetype = "dashed")
## Warning: Removed 1 rows containing missing values (geom_hline).
summary(data1)
## date_GMT referee total_goal_count
## Length:380 Length:380 Min. :0.000
## Class :character Class :character 1st Qu.:2.000
## Mode :character Mode :character Median :3.000
## Mean :2.821
## 3rd Qu.:4.000
## Max. :8.000
## total_goals_at_half_time stadium_name
## Min. :0.000 Length:380
## 1st Qu.:0.000 Class :character
## Median :1.000 Mode :character
## Mean :1.253
## 3rd Qu.:2.000
## Max. :6.000
Display the data within the table (Stadium Name and Total goal count).
table(data1$stadium_name,data1$total_goal_count)
##
## 0 1 2 3 4
## Anfield (Liverpool) 1 2 4 3 3
## Cardiff City Stadium (Cardiff (Caerdydd)) 2 2 3 6 0
## Craven Cottage (London) 0 2 7 5 1
## Emirates Stadium (London) 0 1 10 1 3
## Etihad Stadium (Manchester) 0 3 2 4 5
## Goodison Park (Liverpool) 1 3 7 2 5
## John Smith's Stadium (Huddersfield- West Yorkshire) 1 6 4 6 1
## King Power Stadium (Leicester- Leicestershire) 2 3 5 6 2
## London Stadium (London) 1 3 4 2 6
## Molineux Stadium (Wolverhampton- West Midlands) 1 2 9 3 2
## Old Trafford (Manchester) 2 0 4 6 3
## Selhurst Park (London) 3 4 6 2 3
## St. James' Park (Newcastle upon Tyne) 1 3 4 8 1
## St. Mary's Stadium (Southampton- Hampshire) 2 0 4 6 5
## Stamford Bridge (London) 2 1 7 3 3
## The American Express Community Stadium (Falmer- East Sussex) 1 6 4 2 3
## Tottenham Hotspur Stadium (London) 0 2 1 0 2
## Turf Moor (Burnley) 0 2 6 4 6
## Vicarage Road (Watford) 1 2 3 9 1
## Vitality Stadium (Bournemouth- Dorset) 1 3 4 4 5
## Wembley Stadium (London) 0 5 1 2 5
##
## 5 6 7 8
## Anfield (Liverpool) 3 2 1 0
## Cardiff City Stadium (Cardiff (Caerdydd)) 3 3 0 0
## Craven Cottage (London) 1 3 0 0
## Emirates Stadium (London) 2 2 0 0
## Etihad Stadium (Manchester) 2 1 2 0
## Goodison Park (Liverpool) 0 0 0 1
## John Smith's Stadium (Huddersfield- West Yorkshire) 1 0 0 0
## King Power Stadium (Leicester- Leicestershire) 1 0 0 0
## London Stadium (London) 1 1 1 0
## Molineux Stadium (Wolverhampton- West Midlands) 1 0 1 0
## Old Trafford (Manchester) 4 0 0 0
## Selhurst Park (London) 0 0 0 1
## St. James' Park (Newcastle upon Tyne) 2 0 0 0
## St. Mary's Stadium (Southampton- Hampshire) 1 1 0 0
## Stamford Bridge (London) 3 0 0 0
## The American Express Community Stadium (Falmer- East Sussex) 3 0 0 0
## Tottenham Hotspur Stadium (London) 0 0 0 0
## Turf Moor (Burnley) 0 1 0 0
## Vicarage Road (Watford) 3 0 0 0
## Vitality Stadium (Bournemouth- Dorset) 0 2 0 0
## Wembley Stadium (London) 1 0 0 0
Visualize the table data into ggplot.
ggplot(data1, aes(x = stadium_name, y = total_goal_count, color = total_goals_at_half_time))+
geom_boxplot()
Time to run the ANOVA
twoWayAnova <- aov(total_goal_count ~ total_goals_at_half_time*date_GMT, data = data1)
summary(twoWayAnova)
## Df Sum Sq Mean Sq F value Pr(>F)
## total_goals_at_half_time 1 429.0 429.0 255.154 <2e-16 ***
## date_GMT 211 261.0 1.2 0.736 0.970
## total_goals_at_half_time:date_GMT 57 96.8 1.7 1.010 0.473
## Residuals 110 185.0 1.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So here we examined if cut, color and the interaction between the two will have an effect on the tournament.
plot(twoWayAnova, 1:5)
## Warning: not plotting observations with leverage one:
## 2, 5, 7, 8, 9, 10, 11, 16, 17, 18, 19, 20, 21, 23, 26, 27, 28, 29, 30, 31, 37, 38, 39, 40, 41, 47, 48, 49, 50, 51, 58, 59, 60, 69, 70, 71, 77, 78, 79, 80, 81, 88, 89, 90, 96, 99, 100, 101, 106, 107, 108, 109, 110, 111, 116, 117, 118, 127, 128, 129, 130, 131, 137, 138, 139, 140, 143, 144, 149, 150, 151, 157, 158, 159, 160, 161, 167, 168, 169, 170, 171, 172, 179, 180, 181, 188, 189, 190, 196, 197, 198, 199, 200, 201, 202, 203, 205, 208, 209, 210, 211, 217, 218, 219, 220, 221, 228, 229, 230, 235, 236, 237, 238, 239, 240, 241, 247, 248, 249, 250, 251, 252, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 271, 273, 274, 275, 280, 286, 287, 288, 289, 290, 294, 296, 297, 298, 299, 303, 304, 305, 311, 312, 313, 314, 320, 324, 325, 326, 327, 332, 333, 334, 335, 336, 337, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced