Anova 2 Report

Daria Skarbek, 184869

Published on 3.01.2022

Creating dataset

In this report we will use randomly generated data. We set a seed at some universal number (student id) to keep the data the same for all process.

We create a vector of 100 observations with mean at 50 and standard deviation 20. This will be the voter_age random values. We alter those values not to have any below 18 years (as those people cannot vote yet). voter_race will create random 100 values from given set of 5 options ‘white’, ‘Hispanic’, ‘black’, ‘Asian’ and ‘other’ with given probability. Hence, the white race will be most popular when Asian and other least popular. Voter_gender takes only two values ‘male’ and ‘female’ with the same probability 0.5.

voter_age <- rnorm(100, 50, 20) # Generate ages               
voter_age <- ifelse(voter_age<18,18,voter_age)

voter_race <- sample(c("white", "hispanic", # Generate races
                     "black", "asian", "other"),              
                     prob = c(0.5, 0.25 ,0.15, 0.1, 0.1), 
                     size = 100,
                     replace = TRUE)

voter_gender <- sample(c("male","female"),  # Generate genders
                       size = 100, 
                       prob = c(0.5,0.5),
                       replace = TRUE)

To make data more diverse and easier to distinguish we alter the age of voters depending on their gender as well as race (Asian).

voter_age2 <- ifelse(voter_gender=="male", voter_age - 1.5, voter_age + 1.5) 
voter_age2 <- ifelse(voter_age2<18, 18, voter_age2)

df_age <- ifelse((voter_gender=="female") & (voter_race=="asian"), (voter_age + 10), voter_age)

Now, we create a data frame from all generated values. Using kableExtra package we can review top 10 rows to get the view of the dataset.

df <- data.frame(df_age, voter_gender, voter_race)
df$voter_race <- as.factor(df$voter_race)
df$voter_gender <- as.factor(df$voter_gender)

kbl(head(df, 10), caption = "First 10 rows from generated random sample", col.names = c("Age", "Gender", "Race")) %>%
  kable_material_dark() %>%
  column_spec(1, background = spec_color(df$df_age[1:10], end = 0.8))

First 10 rows from generated random sample
Age	Gender	Race
70.66128	female	hispanic
65.19432	female	white
47.03359	female	other
57.69309	female	white
61.26358	male	white
63.09253	male	white
46.73995	female	white
61.10563	male	hispanic
43.39356	female	black
23.65638	female	hispanic

Data & Plots

Now we want to take a look on how the data turned out to be. To do that, we will review the summary and create some plots to understand the data better.

kbl(summary(df))

df_age	voter_gender	voter_race
Min. : 18.00	female:53	asian :10
1st Qu.: 43.47	male :47	black :12
Median : 54.48	NA	hispanic:20
Mean : 55.06	NA	other : 9
3rd Qu.: 66.55	NA	white :49
Max. :106.13	NA	NA

ggplot(aes(x = df_age, y = voter_gender, fill = voter_gender), data = df) + geom_violin() + geom_jitter(color = 'black') + theme_minimal() + theme(legend.position = 'none') + xlab("Voter age") + ylab("") + scale_fill_brewer(palette="Greens") + ggtitle("Age distribution of voters grouped by gender")

ggplot(aes(x = df_age, y = voter_race, fill = voter_race), data = df) + geom_boxplot() + geom_jitter(color = 'black') + theme_minimal() + theme(legend.position = 'none') + xlab("Voter race") + ylab("") + scale_fill_brewer(palette="Greens") + ggtitle("Age distribution of voters grouped by race")

ggplot(aes(x = df_age, y = voter_race, fill = voter_gender), data = df) + geom_boxplot() + theme_minimal() + ggtitle("Age distribution of voters") + xlab("Voter age") + ylab("Voter race")  + scale_fill_brewer(name = "Gender", palette="Greens")

Analysis of the plots: - distribution of voters based on gender is quite similar. However, due to the adjustment we made at the beginning, we can observe that men start voting at an earlier age than women. Females tend to extend their voting time in later years. - distribution of voters based on their race shows that the difference is quite significant. However, we must take into consideration that the sample is relatively small (only 100 observations). From provided summary we can see that for example only 12 black and 10 Asian voters were put into the dataset. Hence, it is not a good idea to jump into conclusions working on such a small sample yet.

ANOVA

Two-way ANOVA test hypotheses

There is no difference in the means of factor A
There is no difference in means of factor B
There is no interaction between factors A and B

The alternative hypothesis for cases 1 and 2 is: the means are not equal.

The alternative hypothesis for case 3 is: there is an interaction between A and B.

TASK : provide introduction about data, statistics, plots; perform anova + decision and post-hoc tests

In this example we may write those null hypothesis in following forms:

H_0A : Factor A = gender of voters has no effect on the age the people vote.
H_0B : Factor B = race of voters has no effect on the age the people vote.
H_0AxB : There is no interaction between gender and race of voters that has an affect on the age they vote.

Hence, we will calculate test statistics F three times and decide which of the null hypothesis can be rejected.

Assumptions

To calculate ANOVA with interactions we must state some assumptions at the beginning. ANOVA test can be performed if:

samples are independent (when talking about independent ANOVA)
samples come from populations that are normally distributed - normality check
homogeneity of variances in groups - check of homoscedasticity

Normality

First, lets look at the normality assumption. We will use Shapiro test. The null hypothesis in the test is H₀ : Sample data come from normal population.

kbl(mshapiro_test(df$df_age))

statistic	p.value
0.9902157	0.6820391

At alpha level being 0.05, we cannot reject the null hypothesis. Hence, we state that the data come from normal population. That allows us to continue ANOVA analysis.

If we wanted to view the results in a different way we could have use the following method:

kbl(df %>%
  group_by(voter_gender) %>%
  shapiro_test(df_age))

voter_gender	variable	statistic	p
female	df_age	0.9781397	0.4373404
male	df_age	0.9720733	0.3170315

kbl(df %>%
  group_by(voter_race) %>%
  shapiro_test(df_age))

voter_race	variable	statistic	p
asian	df_age	0.9722693	0.9110511
black	df_age	0.8756657	0.0771469
hispanic	df_age	0.9569201	0.4843039
other	df_age	0.9072652	0.2972001
white	df_age	0.9668000	0.1800314

Now we can observe in which occasions the normality assumption is met and when (if at any case) it is not. However, each p value is bigger than 0.05 that we chosen as our significance level.

Homogeneity of variances

To check the homogeneity of variances we will use two tests: Bartlett’s Test and Levene’s Test. Levene’s test is more robust and not as sensitive to outliers. However, just for exemplary purpose we will use both of them.

bartlett.test(df_age ~ interaction(voter_race, voter_gender))

## 
##  Bartlett test of homogeneity of variances
## 
## data:  df_age by interaction(voter_race, voter_gender)
## Bartlett's K-squared = 14.524, df = 9, p-value = 0.1049

Bartlett’s Test with 2 independent variables voter_race and voter_gender returns a p-value of 0.1. This is bigger than our alpha, so we can assume the homogeneity of variances is fulfilled.

leveneTest(df_age ~ voter_race * voter_gender)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  9  1.3847 0.2069
##       90

Here, the Levene’s Test also guides us to the same conclusion. The assumption of homogeneity of variances is then fullfiled.

ANOVA Calculations

Now, having all assumptions fulfilled we can calculate ANOVA.

Since we want to calculate ANOVA with interactions we must use the following formula to generate the model:

model <- aov(data = df, df_age~voter_race * voter_gender)

Now, we can substitute the model into formula for ANOVA.

As we know, the calculation of ANOVA takes following components:

Sum of Squares for Treatment A (gender), Treatment B (race), Interaction AxB, Error (within)
Mean Sum of Squares for all above
Degrees of Freedom
F value

All this parts can be computed manually. However, thanks to R function ANOVA we can also easily see the p value and quickly assess whether the null hypothesis can or cannot be rejected.

kbl(anova(model))

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
voter_race	4	2130.84480	532.71120	1.5460248	0.1956939
voter_gender	1	18.35712	18.35712	0.0532757	0.8179822
voter_race:voter_gender	4	896.59175	224.14794	0.6505181	0.6279849
Residuals	90	31011.15116	344.56835	NA	NA

ANOVA results states that: - there is no significant connection between voter race and voter age. p-value 0.2 is bigger than our significance level, hence we cannot reject H_0.

there is also no significant connection between voter gender and voter age. p-value is 0.82 and that is much higher than alpha.
there is also no connection between interaction of gender and race and voter age. We cannot reject null hypothesis.

From that, we can see that we did not manage to reject any null hypothesis. Hence, the truth for our case is still that race and gender of voters does not influence the age they actually vote.

We can generate an interaction plot to review the connections visually.

interaction.plot(x.factor = df$voter_race, trace.factor = df$voter_gender, 
                 response = df$df_age, fun = mean, 
                 type = "b", legend = TRUE, 
                 xlab = "Race of voters", ylab = "Age of voters",
                 pch = c(1,19), 
                 trace.label = "",
                 col = c("#D83FFF", "#12FEF7"))

Another way to visualize the ANOVA result is by using Plot2WayANOVA. Here we get the results of ANOVA calculations as well as interaction plot. It is similar to the previously generated one (the scale is different) but it also shows a 95% confidence level. Also a check of homogeneity of variance as well as normality is performed here. It would also perfrom post-hoc test if any null hypothesis was rejected.

Plot2WayANOVA(formula = df_age ~ voter_race * voter_gender, dataframe = df)

## 
##              --- WARNING! ---
##      You have an unbalanced design. Using Type II sum of 
##             squares, to calculate factor effect sizes eta and omega.
##             Your two factors account for 0.09 of the type II sum of 
##             squares.

## term                    |     sumsq |  meansq | df | statistic | p.value | etasq | partial.etasq | omegasq | partial.omegasq | epsilonsq | cohens.f | power
## -----------------------------------------------------------------------------------------------------------------------------------------------------------
## voter_race              |  2139.555 | 534.889 |  4 |     1.552 |   0.194 | 0.063 |         0.065 |   0.022 |           0.022 |     0.022 |    0.263 | 0.486
## voter_gender            |    18.357 |  18.357 |  1 |     0.053 |   0.818 | 0.001 |         0.001 |  -0.009 |          -0.010 |    -0.010 |    0.024 | 0.056
## voter_race:voter_gender |   896.592 | 224.148 |  4 |     0.651 |   0.628 | 0.026 |         0.028 |  -0.014 |          -0.014 |    -0.014 |    0.170 | 0.215
## Residuals               | 31011.151 | 344.568 | 90 |           |         |       |               |         |                 |           |          |

## 
## Table of group means

## # A tibble: 10 x 15
## # Groups:   voter_race [5]
##    voter_race voter_gender TheMean TheSD TheSEM CIMuliplier LowerBoundCI
##    <fct>      <fct>          <dbl> <dbl>  <dbl>       <dbl>        <dbl>
##  1 asian      female          60.8 10.8    4.81        2.78         47.4
##  2 asian      male            58.7 22.2    9.91        2.78         31.2
##  3 black      female          54.1 24.2   10.8         2.78         24.1
##  4 black      male            46.9 32.7   12.3         2.45         16.6
##  5 hispanic   female          49.1 18.5    5.13        2.18         37.9
##  6 hispanic   male            57.2 18.0    6.80        2.45         40.5
##  7 other      female          37.7  9.01   4.51        3.18         23.4
##  8 other      male            51.5 26.8   12.0         2.78         18.2
##  9 white      female          59.2 15.3    3.00        2.06         53.0
## 10 white      male            57.6 14.8    3.09        2.07         51.2
## # ... with 8 more variables: UpperBoundCI <dbl>, LowerBoundSEM <dbl>,
## #   UpperBoundSEM <dbl>, LowerBoundSD <dbl>, UpperBoundSD <dbl>, N <int>,
## #   LowerBound <dbl>, UpperBound <dbl>

## 
## Post hoc tests for all effects that were significant

## [1] "No signfiicant effects"

## 
## Testing Homogeneity of Variance with Brown-Forsythe

## Brown-Forsythe Test for Homogeneity of Variance using median
##       Df F value Pr(>F)
## group  9  1.3847 0.2069
##       90

## 
## Testing Normality Assumption with Shapiro-Wilk

## 
##  Shapiro-Wilk normality test
## 
## data:  MyAOV_residuals
## W = 0.9809, p-value = 0.156

## 
## Bayesian analysis of models in order

## # A tibble: 4 x 4
##   model                                 bf support               margin_of_error
##   <chr>                              <dbl> <chr>                           <dbl>
## 1 voter_race                        0.389  " data support is an~     0.000000724
## 2 voter_gender                      0.214  " data support is mo~     0.000253   
## 3 voter_race + voter_gender         0.0836 " data support is st~     0.0253     
## 4 voter_race + voter_gender + vote~ 0.0163 " data support is ve~     0.0141

## 
## Interaction graph plotted...

Post-hoc tests

Since we did not reject H₀ there is no point in performing post-hoc test.

However, just in this case we will try to view Tukey HSD True Significant Difference Test as it should only confirm that there is no connection between checked variables.

post_hoc1 <- TukeyHSD(model)
post_hoc1

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = df_age ~ voter_race * voter_gender, data = df)
## 
## $voter_race
##                      diff        lwr       upr     p adj
## black-asian     -9.867689 -31.993785 12.258407 0.7271518
## hispanic-asian  -7.825068 -27.838875 12.188739 0.8120482
## other-asian    -14.363255 -38.106475  9.379966 0.4488988
## white-asian     -1.338277 -19.269588 16.593034 0.9995774
## hispanic-black   2.042620 -16.826578 20.911819 0.9981723
## other-black     -4.495566 -27.282287 18.291156 0.9817727
## white-black      8.529412  -8.114689 25.173512 0.6122884
## other-hispanic  -6.538186 -27.280004 14.203632 0.9044483
## white-hispanic   6.486791  -7.225044 20.198626 0.6814270
## white-other     13.024978  -5.715419 31.765374 0.3067237
## 
## $voter_gender
##                  diff       lwr      upr     p adj
## male-female 0.8494951 -6.539367 8.238357 0.8198478
## 
## $`voter_race:voter_gender`
##                                      diff       lwr      upr     p adj
## black:female-asian:female      -6.6536030 -44.74314 31.43594 0.9999043
## hispanic:female-asian:female  -11.6769627 -43.36937 20.01545 0.9712784
## other:female-asian:female     -23.0697813 -63.46984 17.33027 0.7000385
## white:female-asian:female      -1.6202349 -31.02955 27.78908 1.0000000
## asian:male-asian:female        -2.0754824 -40.16502 36.01406 1.0000000
## black:male-asian:female       -13.9424493 -49.20651 21.32161 0.9550324
## hispanic:male-asian:female     -3.6365251 -38.90059 31.62753 0.9999990
## other:male-asian:female        -9.2659675 -47.35551 28.82357 0.9985831
## white:male-asian:female        -3.2303823 -32.94744 26.48668 0.9999984
## hispanic:female-black:female   -5.0233597 -36.71577 26.66905 0.9999578
## other:female-black:female     -16.4161783 -56.81623 23.98388 0.9468269
## white:female-black:female       5.0333681 -24.37595 34.44268 0.9999194
## asian:male-black:female         4.5781206 -33.51142 42.66766 0.9999961
## black:male-black:female        -7.2888462 -42.55291 27.97521 0.9996150
## hispanic:male-black:female      3.0170780 -32.24698 38.28114 0.9999998
## other:male-black:female        -2.6123645 -40.70190 35.47717 1.0000000
## white:male-black:female         3.4232207 -26.29384 33.14028 0.9999973
## other:female-hispanic:female  -11.3928186 -45.82769 23.04206 0.9860885
## white:female-hispanic:female   10.0567278 -10.40064 30.51409 0.8471477
## asian:male-hispanic:female      9.6014803 -22.09093 41.29389 0.9925563
## black:male-hispanic:female     -2.2654865 -30.49933 25.96836 0.9999999
## hispanic:male-hispanic:female   8.0404377 -20.19341 36.27428 0.9952746
## other:male-hispanic:female      2.4109952 -29.28142 34.10341 0.9999999
## white:male-hispanic:female      8.4465804 -12.45078 29.34394 0.9485098
## white:female-other:female      21.4495464 -10.89639 53.79548 0.4972123
## asian:male-other:female        20.9942989 -19.40576 61.39435 0.8003212
## black:male-other:female         9.1273320 -28.62059 46.87525 0.9986489
## hispanic:male-other:female     19.4332562 -18.31466 57.18118 0.8088557
## other:male-other:female        13.8038138 -26.59624 54.20387 0.9826357
## white:male-other:female        19.8393990 -12.78659 52.46539 0.6198494
## asian:male-white:female        -0.4552475 -29.86456 28.95407 1.0000000
## black:male-white:female       -12.3222143 -37.96688 13.32245 0.8638243
## hispanic:male-white:female     -2.0162901 -27.66095 23.62837 0.9999999
## other:male-white:female        -7.6457326 -37.05505 21.76358 0.9976276
## white:male-white:female        -1.6101474 -18.84959 15.62929 0.9999996
## black:male-asian:male         -11.8669668 -47.13103 23.39709 0.9843555
## hispanic:male-asian:male       -1.5610426 -36.82510 33.70302 1.0000000
## other:male-asian:male          -7.1904851 -45.28002 30.89905 0.9998172
## white:male-asian:male          -1.1548998 -30.87196 28.56216 1.0000000
## hispanic:male-black:male       10.3059242 -21.88561 42.49746 0.9889518
## other:male-black:male           4.6764817 -30.58758 39.94054 0.9999909
## white:male-black:male          10.7120670 -15.28494 36.70908 0.9421647
## other:male-hispanic:male       -5.6294425 -40.89350 29.63462 0.9999552
## white:male-hispanic:male        0.4061428 -25.59087 26.40315 1.0000000
## white:male-other:male           6.0355853 -23.68147 35.75264 0.9996663

From all this p - values we can see that none of them is smaller than 0.05. Hence, there is no connection between age, race and gender of voters.

par(mfrow=c(2,1))
par(mar=c(4.8,3.5,2.8,1.5))
plot(post_hoc1)