We have decided to perform two ANOVA test. They are united by the common variable: in both tests, we picked up years of education as a continuous variable.
We believe that Russia’s information background plays an important role in shaping public opinion on various events. The state curriculum presents information in a favorable light for government action, so people who study more can like the government more than others. Therefore, we will investigate the relationship between the number of years of education and the degree of satisfaction with state`s actions.
ESS8RU3 <- read.spss("ESS8RU.sav", use.value.labels = T, to.data.frame = T)
rus_anova <- ESS8RU3 %>% dplyr::select(eduyrs, stfgov) %>% na.omit(rus_anova)
rus_anova$eduyrs = as.numeric(rus_anova$eduyrs)
ggplot(rus_anova, aes(x = eduyrs, y = ..density..)) +
geom_histogram(binwidth = 1, alpha = 0.3, fill = c("white"), colour = "black")+
geom_density(alpha = 0.7, fill = c("lightblue"))+
scale_x_continuous(breaks = 0:25*1)+
labs(title = "Distribution of years of formal education", x = "Years of education", y = "Density")+
theme_bw()
So, we can state that our distribution is close to bimodal, yet, it can be explained by the Russian educational system: college and university students spend a different number of years studying, this results in two peaks in our graph.
Then, we look at the distribution of respondents by groups:
#Recoding all values to several categories
rus_anova$level <- rep(NA, length(rus_anova$stfgov))
rus_anova$level [rus_anova$stfgov == 0|
rus_anova$stfgov == 1|
rus_anova$stfgov == 2|
rus_anova$stfgov == 3|
rus_anova$stfgov == 4] <- "Dissatisfied"
rus_anova$level [rus_anova$stfgov == 5|
rus_anova$stfgov == 6 ] <- "Neutral"
rus_anova$level [rus_anova$stfgov == 7|
rus_anova$stfgov == 8|
rus_anova$stfgov == 9|
rus_anova$stfgov == 10] <- "Satisfied"
rus_anova$level <- as.factor(rus_anova$level)
ggplot(data = na.omit(subset(rus_anova, select = level)), aes(x = level, y = ..count../sum(..count..)))+
geom_bar(alpha = 0.5, fill = wes_palette("Rushmore1", 3), color = "black") +
labs(title = "Distribution of people according to their\nsatisfaction with the government ", x = "Levels of satisfaction of the national government", y = "Percentages") +
geom_text(aes(label = percent(..count../sum(..count..))), size = 7, stat= "count", position = position_stack(vjust = 0.5)) +
scale_y_continuous(labels = percent) +
theme_bw()
For the ANOVA test, it is essential that groups are of comparable sizes. As we can see from the bar chart, our three groups: those who dissatisfied with the national government, those who are neutral and those who are satisfied – more or less comparable.
ggplot(data = na.omit(subset(rus_anova, select = c(eduyrs, level)))) +
geom_boxplot(aes(x = level, y = eduyrs), fill = wes_palette("Rushmore1", 3), alpha = 0.5) +
labs(title = "Distribution of years of formal educations\naccording to the level of the satisfaction with the government", x = "Years of education", y = "Satisfaction with the government") +
theme_bw()
From this boxplot, we see that years of education are distributed rather normally across the groups with different levels of satisfaction with the government.
So, after checking our variables, we finally are ready for the the ANOVA test.
1.Independence of observations. This assumption is matched because it depends on the good methodology
2.Equality of variances
3.Normal distribution of residuals (this assumption would be checked after ANOVA test)
So, we should check equality of variances
leveneTest(rus_anova$eduyrs ~ rus_anova$level)
Since the p-value is equal small, so it can be stated that variances are not equal. Yet, according to the rule of thumb, ANOVA is robust to the heterogeneity of variance if the largest variance is not more than 4 times the smallest variance. Our data meets this criterion, thus, this assumption for ANOVA is matched.
Hypotheses:
HO: True means across the groups are equal
H1: True means across the groups are not equal
oneway.test(rus_anova$eduyrs ~ rus_anova$level)
##
## One-way analysis of means (not assuming equal variances)
##
## data: rus_anova$eduyrs and rus_anova$level
## F = 2.9195, num df = 2.0, denom df = 1243.5, p-value = 0.05433
From the p-value we can say that the means across the groups are approximately equal.
We have decided to apply a non-parametric test for ANOVA since not all assumptions for original ANOVA are perfectly met.
H0: mean ranks of the groups are the same
H1: mean ranks of the groups are not the same
kruskal.test(eduyrs ~ level, data = rus_anova)
##
## Kruskal-Wallis rank sum test
##
## data: eduyrs by level
## Kruskal-Wallis chi-squared = 4.838, df = 2, p-value = 0.08901
Since p-value > 0.05, we should state that the mean ranks of the groups are the same. This proves the results of the previous, original ANOVA test.
All in all, we can conclude that one’s education which was measured by the years one spends on studying does not influence whether a person is satisfied with the government or not.
Knowledgable and properly educated workers increased in demand in various spheres of production and service. We decided to take a closer look at the data about types of organizations people work for and the total sum of years a person had been formally educated, and conduct an analysis of variance (ANOVA) test to check whether the true mean of educational years of workers in different types of organizations is equal or not.
rus4$tporgwk <- as.factor(rus4$tporgwk)
# type of organization a person works in [categorical variable]
rus4$eduyrs <- as.numeric(rus4$eduyrs)
# the total sum of years a person was formally educated [continuos variable].
Formally stated hypotheses:
H0: The true mean for each group is equal.
H1: The true mean for each group is not equal.
Now, let`s take a look at the data:
rus_stat <- rus4 %>% group_by(tporgwk) %>% summarise(count = n(), mean = mean(eduyrs), median = median(eduyrs), max = max(eduyrs), min = min(eduyrs)) %>% arrange(desc(median))
formattable(rus_stat,
align =c("l","c", "c", "c", "c", "c"),
list(`Indicator Name` = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold"))))
tporgwk | count | mean | median | max | min |
---|---|---|---|---|---|
Central or local government | 93 | 14.33333 | 15 | 19 | 4 |
Other public sector (such as education and health) | 489 | 13.61145 | 14 | 22 | 1 |
Self employed | 124 | 13.41935 | 14 | 19 | 7 |
A private firm | 1064 | 13.27162 | 13 | 21 | 5 |
A state owned enterprise | 310 | 12.17097 | 12 | 20 | 2 |
From the table, it can be seen that in each type of organization there is at least one person who hasn`t graduated from school and at least one person with a master degree and higher.
ggplot(rus4)+
geom_boxplot(aes(x = tporgwk, y = eduyrs),alpha = 0.5, color = "black", fill = wes_palette("Rushmore1", 5))+
labs(title = "Years spent on formal education by people\nworking in different types of organizations", y = "Number of years", x = "Type of organization")+
scale_y_continuous(breaks = 0:23*5)+
theme_bw()+
theme(axis.text.x = element_text(angle = 20, hjust = 1))
From the boxplot, a lot of outliers can be observed almost in each category, which in turn indicate the fact that people with drastically different level of education CAN work in almost any sphere of employment.
As for now, let`s have a look at the percentage of people working in each type of organization:
ggplot(rus4, aes(x = as.factor(tporgwk), y =..count../sum(..count..)))+
geom_bar(color = "black", fill = wes_palette("Rushmore1", 5), alpha = 0.5)+
geom_text(aes(label = percent(..count../sum(..count..))), size = 5, stat= "count", position = position_stack(vjust = 0.5)) +
scale_y_continuous(labels = percent) +
labs(title = "Percentage of people employed in defferent\ntypes of organizations", x = "Type of organization", y = "Percentage, %")+
theme_bw()+
theme(axis.text.x = element_text(angle = 20, hjust = 1))
The bar chart reveals that more than half of the sample population is employed in private firms, almost a quarter in the public service sector and the remaining quarter is divided among state service, self-employed and governmental service (<5%).
oneway.test(rus4$eduyrs ~ rus4$tporgwk, var.equal = F)
##
## One-way analysis of means (not assuming equal variances)
##
## data: rus4$eduyrs and rus4$tporgwk
## F = 15.297, num df = 4.00, denom df = 379.46, p-value = 1.315e-11
First, try on ANOVA testing reports: F(4) = 15.297, p-value = 1.315e-11 <<0.05, which means that the difference in mean values of educational years is statistically significant.
aov_output <- aov(rus4$eduyrs ~ rus4$tporgwk)
summary(aov_output)
## Df Sum Sq Mean Sq F value Pr(>F)
## rus4$tporgwk 4 538 134.5 19.21 1.56e-15 ***
## Residuals 2075 14523 7.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Second try on ANOVA teasting confirms the results from the first one: F(4, 2075) = 19.21, with p-value = 1.56e-15 << 0.05, meaning that the mean values among groups are indeed unequal.
Normality of residuals assumption:
3.1 Via plotting
plot(aov_output, 2)
Q-Q plot presents the data to be roughly normal as the majority of observations follow the straight diagonal line.
pairwise.t.test(rus4$eduyrs, rus4$tporgwk, adjust = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: rus4$eduyrs and rus4$tporgwk
##
## A state owned enterprise
## A private firm 1.1e-09
## Self employed 6.6e-05
## Other public sector (such as education and health) 9.4e-13
## Central or local government 5.7e-11
## A private firm
## A private firm -
## Self employed 0.9406
## Other public sector (such as education and health) 0.0638
## Central or local government 0.0013
## Self employed
## A private firm -
## Self employed -
## Other public sector (such as education and health) 0.9406
## Central or local government 0.0593
## Other public sector (such as education and health)
## A private firm -
## Self employed -
## Other public sector (such as education and health) -
## Central or local government 0.0638
##
## P value adjustment method: holm
From the table, it can be concluded that such pairwise comparisons as “Private firm VS Self-employed” and “Self-employed VS another public sector” did not contribute to the significance of the results, whereas other pairs do (we considered p-value AROUND 0.05 to be significant). Simply speaking, the mean year of education across these pairs is nearly equal, that is why it doesn`t produce p-value smaller than 0.05.
Since in our case not all the ANOVA assumptions were perfectly met, it is meaningful to run a non-parametric test to validate the results:
kruskal.test(eduyrs ~ tporgwk, data = rus4)
##
## Kruskal-Wallis rank sum test
##
## data: eduyrs by tporgwk
## Kruskal-Wallis chi-squared = 75.249, df = 4, p-value = 1.765e-15
Kruskal-Wallis chi-square(4) = 75.249, with p-value = 1.765e-15 << 0.05, which means that the mean value of educational years differs across groups.
Non-parametric post hoc test
Since the non-parametric test produces significant results, we can run non-parametric post hoc test to be even more confident in the results:
dunn.test(rus4$eduyrs, rus4$tporgwk, kw = F)
##
## Comparison of x by group
## (No adjustment)
## Col Mean-|
## Row Mean | A privat A state Central Other pu
## ---------+--------------------------------------------
## A state | 5.491063
## | 0.0000*
## |
## Central | -4.754063 -7.345557
## | 0.0000* 0.0000*
## |
## Other pu | -2.797061 -6.986442 3.193343
## | 0.0026* 0.0000* 0.0007*
## |
## Self emp | -0.684781 -3.946918 3.273804 0.873569
## | 0.2467 0.0000* 0.0005* 0.1912
##
## alpha = 0.05
## Reject Ho if p <= alpha/2
As in the results of the pairwise t-test, the difference in mean values of 2 pairs (“Private firm VS Self employed” and “Self-employed VS another public sector”) is repeatedly statistically insignificant. In other words, only those values which are marked with a star sign are statistically significant.
The effect size
omega_sq <- function(aov_output){
sum_stats <- summary(aov_output)[[1]]
SSm <- sum_stats[["Sum Sq"]][1]
SSr <- sum_stats[["Sum Sq"]][2]
DFm <- sum_stats[["Df"]][1]
MSr <- sum_stats[["Mean Sq"]][2]
W2 <- (SSm-DFm*MSr)/(SSm+SSr+MSr)
return(W2)
}
omega_sq(aov_output)
## [1] 0.03384191
The effect size of 0.03 is a small one, indicating that even though the difference between groups is statistically significant, in reality, the distinction doesn`t matter much.
In the majority of cases almost in every type of organizations, the prevailing share of people will possess no university diploma, as the mean years are less than 15 for each group. Still, we have to agree that governmental staff is the most educated group and people working in state structures with a high probability will have at least bachelor degree, which sounds satisfying.