Creating dataset
In this report we will use randomly generated data. We set a seed at some universal number (student id) to keep the data the same for all process.
We create a vector of 100 observations with mean at 50 and standard deviation 20. This will be the voter_age random values. We alter those values not to have any below 18 years (as those people cannot vote yet). voter_race will create random 100 values from given set of 5 options ‘white’, ‘Hispanic’, ‘black’, ‘Asian’ and ‘other’ with given probability. Hence, the white race will be most popular when Asian and other least popular. Voter_gender takes only two values ‘male’ and ‘female’ with the same probability 0.5.
voter_age <- rnorm(100, 50, 20) # Generate ages
voter_age <- ifelse(voter_age<18,18,voter_age)
voter_race <- sample(c("white", "hispanic", # Generate races
"black", "asian", "other"),
prob = c(0.5, 0.25 ,0.15, 0.1, 0.1),
size = 100,
replace = TRUE)
voter_gender <- sample(c("male","female"), # Generate genders
size = 100,
prob = c(0.5,0.5),
replace = TRUE)
To make data more diverse and easier to distinguish we alter the age of voters depending on their gender as well as race (Asian).
voter_age2 <- ifelse(voter_gender=="male", voter_age - 1.5, voter_age + 1.5)
voter_age2 <- ifelse(voter_age2<18, 18, voter_age2)
df_age <- ifelse((voter_gender=="female") & (voter_race=="asian"), (voter_age + 10), voter_age)
Now, we create a data frame from all generated values. Using kableExtra package we can review top 10 rows to get the view of the dataset.
df <- data.frame(df_age, voter_gender, voter_race)
df$voter_race <- as.factor(df$voter_race)
df$voter_gender <- as.factor(df$voter_gender)
kbl(head(df, 10), caption = "First 10 rows from generated random sample", col.names = c("Age", "Gender", "Race")) %>%
kable_material_dark() %>%
column_spec(1, background = spec_color(df$df_age[1:10], end = 0.8))
| Age | Gender | Race |
|---|---|---|
| 70.66128 | female | hispanic |
| 65.19432 | female | white |
| 47.03359 | female | other |
| 57.69309 | female | white |
| 61.26358 | male | white |
| 63.09253 | male | white |
| 46.73995 | female | white |
| 61.10563 | male | hispanic |
| 43.39356 | female | black |
| 23.65638 | female | hispanic |
Data & Plots
Now we want to take a look on how the data turned out to be. To do that, we will review the summary and create some plots to understand the data better.
kbl(summary(df))
| df_age | voter_gender | voter_race | |
|---|---|---|---|
| Min. : 18.00 | female:53 | asian :10 | |
| 1st Qu.: 43.47 | male :47 | black :12 | |
| Median : 54.48 | NA | hispanic:20 | |
| Mean : 55.06 | NA | other : 9 | |
| 3rd Qu.: 66.55 | NA | white :49 | |
| Max. :106.13 | NA | NA |
ggplot(aes(x = df_age, y = voter_gender, fill = voter_gender), data = df) + geom_violin() + geom_jitter(color = 'black') + theme_minimal() + theme(legend.position = 'none') + xlab("Voter age") + ylab("") + scale_fill_brewer(palette="Greens") + ggtitle("Age distribution of voters grouped by gender")
ggplot(aes(x = df_age, y = voter_race, fill = voter_race), data = df) + geom_boxplot() + geom_jitter(color = 'black') + theme_minimal() + theme(legend.position = 'none') + xlab("Voter race") + ylab("") + scale_fill_brewer(palette="Greens") + ggtitle("Age distribution of voters grouped by race")
ggplot(aes(x = df_age, y = voter_race, fill = voter_gender), data = df) + geom_boxplot() + theme_minimal() + ggtitle("Age distribution of voters") + xlab("Voter age") + ylab("Voter race") + scale_fill_brewer(name = "Gender", palette="Greens")
Analysis of the plots: - distribution of voters based on gender is quite similar. However, due to the adjustment we made at the beginning, we can observe that men start voting at an earlier age than women. Females tend to extend their voting time in later years. - distribution of voters based on their race shows that the difference is quite significant. However, we must take into consideration that the sample is relatively small (only 100 observations). From provided summary we can see that for example only 12 black and 10 Asian voters were put into the dataset. Hence, it is not a good idea to jump into conclusions working on such a small sample yet.
ANOVA
Two-way ANOVA test hypotheses
There is no difference in the means of factor A
There is no difference in means of factor B
There is no interaction between factors A and B
The alternative hypothesis for cases 1 and 2 is: the means are not equal.
The alternative hypothesis for case 3 is: there is an interaction between A and B.
TASK : provide introduction about data, statistics, plots; perform anova + decision and post-hoc tests
In this example we may write those null hypothesis in following forms:
H0A : Factor A = gender of voters has no effect on the age the people vote.
H0B : Factor B = race of voters has no effect on the age the people vote.
H0AxB : There is no interaction between gender and race of voters that has an affect on the age they vote.
Hence, we will calculate test statistics F three times and decide which of the null hypothesis can be rejected.
Assumptions
To calculate ANOVA with interactions we must state some assumptions at the beginning. ANOVA test can be performed if:
- samples are independent (when talking about independent ANOVA)
- samples come from populations that are normally distributed - normality check
- homogeneity of variances in groups - check of homoscedasticity
Normality
First, lets look at the normality assumption. We will use Shapiro test. The null hypothesis in the test is H0 : Sample data come from normal population.
kbl(mshapiro_test(df$df_age))
| statistic | p.value |
|---|---|
| 0.9902157 | 0.6820391 |
At alpha level being 0.05, we cannot reject the null hypothesis. Hence, we state that the data come from normal population. That allows us to continue ANOVA analysis.
If we wanted to view the results in a different way we could have use the following method:
kbl(df %>%
group_by(voter_gender) %>%
shapiro_test(df_age))
| voter_gender | variable | statistic | p |
|---|---|---|---|
| female | df_age | 0.9781397 | 0.4373404 |
| male | df_age | 0.9720733 | 0.3170315 |
kbl(df %>%
group_by(voter_race) %>%
shapiro_test(df_age))
| voter_race | variable | statistic | p |
|---|---|---|---|
| asian | df_age | 0.9722693 | 0.9110511 |
| black | df_age | 0.8756657 | 0.0771469 |
| hispanic | df_age | 0.9569201 | 0.4843039 |
| other | df_age | 0.9072652 | 0.2972001 |
| white | df_age | 0.9668000 | 0.1800314 |
Now we can observe in which occasions the normality assumption is met and when (if at any case) it is not. However, each p value is bigger than 0.05 that we chosen as our significance level.
Homogeneity of variances
To check the homogeneity of variances we will use two tests: Bartlett’s Test and Levene’s Test. Levene’s test is more robust and not as sensitive to outliers. However, just for exemplary purpose we will use both of them.
bartlett.test(df_age ~ interaction(voter_race, voter_gender))
##
## Bartlett test of homogeneity of variances
##
## data: df_age by interaction(voter_race, voter_gender)
## Bartlett's K-squared = 14.524, df = 9, p-value = 0.1049
Bartlett’s Test with 2 independent variables voter_race and voter_gender returns a p-value of 0.1. This is bigger than our alpha, so we can assume the homogeneity of variances is fulfilled.
leveneTest(df_age ~ voter_race * voter_gender)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 9 1.3847 0.2069
## 90
Here, the Levene’s Test also guides us to the same conclusion. The assumption of homogeneity of variances is then fullfiled.
ANOVA Calculations
Now, having all assumptions fulfilled we can calculate ANOVA.
Since we want to calculate ANOVA with interactions we must use the following formula to generate the model:
model <- aov(data = df, df_age~voter_race * voter_gender)
Now, we can substitute the model into formula for ANOVA.
As we know, the calculation of ANOVA takes following components:
- Sum of Squares for Treatment A (gender), Treatment B (race), Interaction AxB, Error (within)
- Mean Sum of Squares for all above
- Degrees of Freedom
- F value
All this parts can be computed manually. However, thanks to R function ANOVA we can also easily see the p value and quickly assess whether the null hypothesis can or cannot be rejected.
kbl(anova(model))
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| voter_race | 4 | 2130.84480 | 532.71120 | 1.5460248 | 0.1956939 |
| voter_gender | 1 | 18.35712 | 18.35712 | 0.0532757 | 0.8179822 |
| voter_race:voter_gender | 4 | 896.59175 | 224.14794 | 0.6505181 | 0.6279849 |
| Residuals | 90 | 31011.15116 | 344.56835 | NA | NA |
ANOVA results states that: - there is no significant connection between voter race and voter age. p-value 0.2 is bigger than our significance level, hence we cannot reject H_0.
there is also no significant connection between voter gender and voter age. p-value is 0.82 and that is much higher than alpha.
there is also no connection between interaction of gender and race and voter age. We cannot reject null hypothesis.
From that, we can see that we did not manage to reject any null hypothesis. Hence, the truth for our case is still that race and gender of voters does not influence the age they actually vote.
We can generate an interaction plot to review the connections visually.
interaction.plot(x.factor = df$voter_race, trace.factor = df$voter_gender,
response = df$df_age, fun = mean,
type = "b", legend = TRUE,
xlab = "Race of voters", ylab = "Age of voters",
pch = c(1,19),
trace.label = "",
col = c("#D83FFF", "#12FEF7"))
Another way to visualize the ANOVA result is by using Plot2WayANOVA. Here we get the results of ANOVA calculations as well as interaction plot. It is similar to the previously generated one (the scale is different) but it also shows a 95% confidence level. Also a check of homogeneity of variance as well as normality is performed here. It would also perfrom post-hoc test if any null hypothesis was rejected.
Plot2WayANOVA(formula = df_age ~ voter_race * voter_gender, dataframe = df)
##
## --- WARNING! ---
## You have an unbalanced design. Using Type II sum of
## squares, to calculate factor effect sizes eta and omega.
## Your two factors account for 0.09 of the type II sum of
## squares.
## term | sumsq | meansq | df | statistic | p.value | etasq | partial.etasq | omegasq | partial.omegasq | epsilonsq | cohens.f | power
## -----------------------------------------------------------------------------------------------------------------------------------------------------------
## voter_race | 2139.555 | 534.889 | 4 | 1.552 | 0.194 | 0.063 | 0.065 | 0.022 | 0.022 | 0.022 | 0.263 | 0.486
## voter_gender | 18.357 | 18.357 | 1 | 0.053 | 0.818 | 0.001 | 0.001 | -0.009 | -0.010 | -0.010 | 0.024 | 0.056
## voter_race:voter_gender | 896.592 | 224.148 | 4 | 0.651 | 0.628 | 0.026 | 0.028 | -0.014 | -0.014 | -0.014 | 0.170 | 0.215
## Residuals | 31011.151 | 344.568 | 90 | | | | | | | | |
##
## Table of group means
## # A tibble: 10 x 15
## # Groups: voter_race [5]
## voter_race voter_gender TheMean TheSD TheSEM CIMuliplier LowerBoundCI
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 asian female 60.8 10.8 4.81 2.78 47.4
## 2 asian male 58.7 22.2 9.91 2.78 31.2
## 3 black female 54.1 24.2 10.8 2.78 24.1
## 4 black male 46.9 32.7 12.3 2.45 16.6
## 5 hispanic female 49.1 18.5 5.13 2.18 37.9
## 6 hispanic male 57.2 18.0 6.80 2.45 40.5
## 7 other female 37.7 9.01 4.51 3.18 23.4
## 8 other male 51.5 26.8 12.0 2.78 18.2
## 9 white female 59.2 15.3 3.00 2.06 53.0
## 10 white male 57.6 14.8 3.09 2.07 51.2
## # ... with 8 more variables: UpperBoundCI <dbl>, LowerBoundSEM <dbl>,
## # UpperBoundSEM <dbl>, LowerBoundSD <dbl>, UpperBoundSD <dbl>, N <int>,
## # LowerBound <dbl>, UpperBound <dbl>
##
## Post hoc tests for all effects that were significant
## [1] "No signfiicant effects"
##
## Testing Homogeneity of Variance with Brown-Forsythe
## Brown-Forsythe Test for Homogeneity of Variance using median
## Df F value Pr(>F)
## group 9 1.3847 0.2069
## 90
##
## Testing Normality Assumption with Shapiro-Wilk
##
## Shapiro-Wilk normality test
##
## data: MyAOV_residuals
## W = 0.9809, p-value = 0.156
##
## Bayesian analysis of models in order
## # A tibble: 4 x 4
## model bf support margin_of_error
## <chr> <dbl> <chr> <dbl>
## 1 voter_race 0.389 " data support is an~ 0.000000724
## 2 voter_gender 0.214 " data support is mo~ 0.000253
## 3 voter_race + voter_gender 0.0836 " data support is st~ 0.0253
## 4 voter_race + voter_gender + vote~ 0.0163 " data support is ve~ 0.0141
##
## Interaction graph plotted...
Post-hoc tests
Since we did not reject H0 there is no point in performing post-hoc test.
However, just in this case we will try to view Tukey HSD True Significant Difference Test as it should only confirm that there is no connection between checked variables.
post_hoc1 <- TukeyHSD(model)
post_hoc1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = df_age ~ voter_race * voter_gender, data = df)
##
## $voter_race
## diff lwr upr p adj
## black-asian -9.867689 -31.993785 12.258407 0.7271518
## hispanic-asian -7.825068 -27.838875 12.188739 0.8120482
## other-asian -14.363255 -38.106475 9.379966 0.4488988
## white-asian -1.338277 -19.269588 16.593034 0.9995774
## hispanic-black 2.042620 -16.826578 20.911819 0.9981723
## other-black -4.495566 -27.282287 18.291156 0.9817727
## white-black 8.529412 -8.114689 25.173512 0.6122884
## other-hispanic -6.538186 -27.280004 14.203632 0.9044483
## white-hispanic 6.486791 -7.225044 20.198626 0.6814270
## white-other 13.024978 -5.715419 31.765374 0.3067237
##
## $voter_gender
## diff lwr upr p adj
## male-female 0.8494951 -6.539367 8.238357 0.8198478
##
## $`voter_race:voter_gender`
## diff lwr upr p adj
## black:female-asian:female -6.6536030 -44.74314 31.43594 0.9999043
## hispanic:female-asian:female -11.6769627 -43.36937 20.01545 0.9712784
## other:female-asian:female -23.0697813 -63.46984 17.33027 0.7000385
## white:female-asian:female -1.6202349 -31.02955 27.78908 1.0000000
## asian:male-asian:female -2.0754824 -40.16502 36.01406 1.0000000
## black:male-asian:female -13.9424493 -49.20651 21.32161 0.9550324
## hispanic:male-asian:female -3.6365251 -38.90059 31.62753 0.9999990
## other:male-asian:female -9.2659675 -47.35551 28.82357 0.9985831
## white:male-asian:female -3.2303823 -32.94744 26.48668 0.9999984
## hispanic:female-black:female -5.0233597 -36.71577 26.66905 0.9999578
## other:female-black:female -16.4161783 -56.81623 23.98388 0.9468269
## white:female-black:female 5.0333681 -24.37595 34.44268 0.9999194
## asian:male-black:female 4.5781206 -33.51142 42.66766 0.9999961
## black:male-black:female -7.2888462 -42.55291 27.97521 0.9996150
## hispanic:male-black:female 3.0170780 -32.24698 38.28114 0.9999998
## other:male-black:female -2.6123645 -40.70190 35.47717 1.0000000
## white:male-black:female 3.4232207 -26.29384 33.14028 0.9999973
## other:female-hispanic:female -11.3928186 -45.82769 23.04206 0.9860885
## white:female-hispanic:female 10.0567278 -10.40064 30.51409 0.8471477
## asian:male-hispanic:female 9.6014803 -22.09093 41.29389 0.9925563
## black:male-hispanic:female -2.2654865 -30.49933 25.96836 0.9999999
## hispanic:male-hispanic:female 8.0404377 -20.19341 36.27428 0.9952746
## other:male-hispanic:female 2.4109952 -29.28142 34.10341 0.9999999
## white:male-hispanic:female 8.4465804 -12.45078 29.34394 0.9485098
## white:female-other:female 21.4495464 -10.89639 53.79548 0.4972123
## asian:male-other:female 20.9942989 -19.40576 61.39435 0.8003212
## black:male-other:female 9.1273320 -28.62059 46.87525 0.9986489
## hispanic:male-other:female 19.4332562 -18.31466 57.18118 0.8088557
## other:male-other:female 13.8038138 -26.59624 54.20387 0.9826357
## white:male-other:female 19.8393990 -12.78659 52.46539 0.6198494
## asian:male-white:female -0.4552475 -29.86456 28.95407 1.0000000
## black:male-white:female -12.3222143 -37.96688 13.32245 0.8638243
## hispanic:male-white:female -2.0162901 -27.66095 23.62837 0.9999999
## other:male-white:female -7.6457326 -37.05505 21.76358 0.9976276
## white:male-white:female -1.6101474 -18.84959 15.62929 0.9999996
## black:male-asian:male -11.8669668 -47.13103 23.39709 0.9843555
## hispanic:male-asian:male -1.5610426 -36.82510 33.70302 1.0000000
## other:male-asian:male -7.1904851 -45.28002 30.89905 0.9998172
## white:male-asian:male -1.1548998 -30.87196 28.56216 1.0000000
## hispanic:male-black:male 10.3059242 -21.88561 42.49746 0.9889518
## other:male-black:male 4.6764817 -30.58758 39.94054 0.9999909
## white:male-black:male 10.7120670 -15.28494 36.70908 0.9421647
## other:male-hispanic:male -5.6294425 -40.89350 29.63462 0.9999552
## white:male-hispanic:male 0.4061428 -25.59087 26.40315 1.0000000
## white:male-other:male 6.0355853 -23.68147 35.75264 0.9996663
From all this p - values we can see that none of them is smaller than 0.05. Hence, there is no connection between age, race and gender of voters.
par(mfrow=c(2,1))
par(mar=c(4.8,3.5,2.8,1.5))
plot(post_hoc1)