Anna Shirokanova and Olesya Volchenko
March 4, 2019
See more: https://statistics.laerd.com/statistical-guides/one-way-anova-statistical-guide.php
Textbook: Field, 2016, chapter 16.
Context to the analysis:
There are theories set to explain on the macro level how societies change when they modernise: get industry, then, information industry, people get rich, live longer lives, etc. What happens to people’s values when life becomes secure and survival guaranteed? Theories that aim to explain these processes are called ‘modernization theories’.
One of the offsprings of modernization theory by R.Inglehart is the recent ‘theory of emancipation’ by C.Welzel (2013, for details, see: https://en.wikipedia.org/wiki/Freedom_Rising). Its main concept is called emancipative values, which are a combination of the freedom of choice and equality of opportunities. These values become especially important to people under certain social conditions, and they motivate further changes in societies by changing life strategies of individuals and putting greater emphasis on human agency.
For an entertaining overview, see: http://www.worldvaluessurvey.org/WVSContents.jsp?CMSID=Findings
RESEMAVAL is standardised; it ranges from 0 to 1 where “0” means the lowest possible level of emancipative values, and “1” is the highest possible level of emancipative values.Let’s explore the issue. First, get the data we need from the all-countries file: (you can download the data for free on their site)
library(foreign)
wvs <- read.spss("WV6_Data_spss_v_2016_01_01.sav", to.data.frame = T, use.value.labels = T)wvsrus <- subset(wvs, V2 == "Russia")
dim(wvsrus) # there are 2,500 observations of 430 variables## [1] 2500 430
wvsrus$RESEMAVAL1 <- as.numeric(as.character(wvsrus$RESEMAVAL)) # emancipative values is a continuous varlibrary(ggplot2)
library(scales)
ggplot(wvsrus, aes(x = RESEMAVAL1)) +
geom_density() +
xlim(0, 1) +
geom_vline(aes(xintercept=mean(RESEMAVAL1)),
color="blue", linetype="dashed", size=1)summary(wvsrus$RESEMAVAL1)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.2998 0.3917 0.3931 0.4850 0.9259 22
par(mar = c(5, 25, 2, 2))
barplot(table(wvsrus$V248)/nrow(wvsrus)*100, las = 2, horiz = T, xlim = c(0, 60))Groups as they are now are very different in numbers, which may affect the F-test.
wvsrus$educ <- rep(NA, length(wvsrus$V248)) # create a new, empty variable
wvsrus$educ[wvsrus$V248 == "No formal education" |
wvsrus$V248 == "Incomplete primary school" |
wvsrus$V248 == "Complete primary school"|
wvsrus$V248 == "Incomplete secondary school: university-preparatory type"|
wvsrus$V248 == "Incomplete secondary school: technical/ vocational type"] <- "primary" # merge all these levels as "primary"
wvsrus$educ[wvsrus$V248 == "Complete secondary school: university-preparatory type"] <- "secondary gen" # rename this level
wvsrus$educ[wvsrus$V248 == "Complete secondary school: technical/ vocational type"] <- "secondary vocational" # rename this level
wvsrus$educ[wvsrus$V248 == "Some university-level education, without degree" |
wvsrus$V248 == "University - level education, with degree" ] <- "tertiary" # merge these levels into "tertiary"
wvsrus$educ <- as.factor(wvsrus$educ)
library(dplyr)
wvsrus <- select(wvsrus, RESEMAVAL1, educ) # select the variables you need to work fasterlibrary(psych)
library(magrittr)
library(knitr)
library(kableExtra)
describeBy(wvsrus$RESEMAVAL1, wvsrus$educ, mat = TRUE) %>% #create dataframe
select(Education = group1, N=n, Mean=mean, SD=sd, Median=median, Min=min, Max=max,
Skew=skew, Kurtosis=kurtosis, st.error = se) %>%
kable(align=c("lrrrrrrrr"), digits=2, row.names = FALSE,
caption="Emancipative Values by Education Level") %>%
kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)| Education | N | Mean | SD | Median | Min | Max | Skew | Kurtosis | st.error |
|---|---|---|---|---|---|---|---|---|---|
| primary | 356 | 0.35 | 0.13 | 0.34 | 0.06 | 0.74 | 0.41 | -0.08 | 0.01 |
| secondary gen | 363 | 0.38 | 0.13 | 0.38 | 0.08 | 0.76 | 0.05 | -0.59 | 0.01 |
| secondary vocational | 973 | 0.39 | 0.13 | 0.39 | 0.00 | 0.93 | 0.07 | 0.04 | 0.00 |
| tertiary | 774 | 0.42 | 0.13 | 0.43 | 0.03 | 0.79 | 0.03 | -0.16 | 0.00 |
par(mar = c(3,10,0,3))
barplot(table(wvsrus$educ)/nrow(wvsrus)*100, horiz = T, xlim = c(0, 60), las = 2) # we get 4 groups of comparable size, which is good for the exampleThe groups are of comparable size now, and the continuous variable is normally distributed within the groups (this is good).
boxplot(wvsrus$RESEMAVAL1 ~ wvsrus$educ)ggplot(wvsrus, aes(x = educ, y = RESEMAVAL1)) +
geom_boxplot() +
theme_classic()From this boxplot, we see that the Y variables is distributed rather normally in across the education groups and that the level of emancipative values is slightly higher among the better educated respondents.
ANOVA is a parametric test that rests on three assumptions:
If those assumptions are not met by the data, the classic ANOVA test is not reliable.
Independence should be checked while collecting the data (for someone else’s data, see the description of the sampling procedure).
Homogeneity (=equality) of variances is not accounted for by default, so you are safe. If the F-test is significant and you want to compare groups, your choice of post hoc tests depends on whether variances are equal across groups.
There’s a rule of thumb that the ANOVA is robust to heterogeneity of variance so long as the largest variance is not more than 4 times the smallest variance (see: link)
Normality of residuals (have a look at the distribution of what is left unexplained when you have accounted for group means).
Homogeneity of variance is easier to check before the analysis, while residuals are easier to check after the test.
library(car)
leveneTest(wvsrus$RESEMAVAL1 ~ wvsrus$educ)## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 0.4393 0.7249
## 2462
P-value is equal to 0.72, which means that variances are equal, i.e. you can indicate in the ANOVA test that var.equal = T.
oneway.test(wvsrus$RESEMAVAL1 ~ wvsrus$educ, var.equal = T) ##
## One-way analysis of means
##
## data: wvsrus$RESEMAVAL1 and wvsrus$educ
## F = 31.738, num df = 3, denom df = 2462, p-value < 2.2e-16
aov.out <- aov(wvsrus$RESEMAVAL1 ~ wvsrus$educ) # another function of ANOVA which should be used here; implies that var.equal = T
summary(aov.out) # df1 = k-1, df2 = N - k## Df Sum Sq Mean Sq F value Pr(>F)
## wvsrus$educ 3 1.58 0.5258 31.74 <2e-16 ***
## Residuals 2462 40.79 0.0166
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 34 observations deleted due to missingness
F(3, 2462) = 31.738, p-value < .001 means that the differences in the level of emancipative values across education groups are not equal.
plot(aov.out, 2)layout(matrix(1:4, 2, 2))
plot(aov.out)In normally distributed residuals, you will see a straight red line in the two upper graphs, and a straight line in the Q-Q plot.
anova.res <- residuals(object = aov.out)
describe(anova.res) # look at skew and kurtosis, should be < 2## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 2466 0 0.13 0 0 0.14 -0.4 0.53 0.93 0.1 -0.12
## se
## X1 0
shapiro.test(x = anova.res) # if the p-value is > .05, the distribution IS normal.##
## Shapiro-Wilk normality test
##
## data: anova.res
## W = 0.99824, p-value = 0.008871
hist(anova.res)Here, skew and kurtosis are OK for a normal distribution. However, the Shapiro-Wilk test says that it is not normal. Visual analysis and skew and kurtosis agree here, which is why we can conclude that the normality assumption holds.
To inspect which groups contribute to the statistical significance of this test, run post hoc (“after analysis”) tests.
Here, you have two options:
if variances are equal across groups, use Tukey’s post hoc test;
if variances are unequal, use pairwise t-test with Bonferroni’s correction.
plot(TukeyHSD(aov.out))#If the previous line does not work, try it step by step:
#Tukey <- TukeyHSD(aov.out)
#Tukey
#plot(Tukey)margins and change the direction of labelspar(mar = c(5, 15, 3, 1))
Tukey <- TukeyHSD(aov.out)
plot(Tukey, las = 2)pairwise.t.test(wvsrus$RESEMAVAL1, wvsrus$educ,
adjust = "bonferroni") # for unequal variances, pool.sd = T##
## Pairwise comparisons using t tests with pooled SD
##
## data: wvsrus$RESEMAVAL1 and wvsrus$educ
##
## primary secondary gen secondary vocational
## secondary gen 0.00087 - -
## secondary vocational 2.0e-08 0.09434 -
## tertiary < 2e-16 3.6e-07 2.4e-06
##
## P value adjustment method: holm
We see that all but one pairs have a statistically significant difference between the means. If you have trouble reading such p-values, use the options parameter:
options(scipen = 999)
pairwise.t.test(wvsrus$RESEMAVAL1, wvsrus$educ,
adjust = "bonferroni") ##
## Pairwise comparisons using t tests with pooled SD
##
## data: wvsrus$RESEMAVAL1 and wvsrus$educ
##
## primary secondary gen
## secondary gen 0.00087 -
## secondary vocational 0.00000002 0.09434
## tertiary < 0.0000000000000002 0.00000036
## secondary vocational
## secondary gen -
## secondary vocational -
## tertiary 0.00000244
##
## P value adjustment method: holm
The results are the same.
kruskal.test(RESEMAVAL1 ~ educ, data = wvsrus)##
## Kruskal-Wallis rank sum test
##
## data: RESEMAVAL1 by educ
## Kruskal-Wallis chi-squared = 92.577, df = 3, p-value <
## 0.00000000000000022
With KW chi-square(3) = 92.6, p-value is < .001, which means that the mean ranks of the education groups are not the same. This results confirms what we saw earlier in the ANOVA test.
If Kruskal-Wallis test is significant, you can run non-parametric post hoc tests, e.g. Dunn’s test:
library(DescTools)
DunnTest(RESEMAVAL1 ~ educ, data = wvsrus)##
## Dunn's test of multiple comparisons using rank sums : holm
##
## mean.rank.diff pval
## secondary gen-primary 196.7600 0.00042 ***
## secondary vocational-primary 266.8025 0.0000000073 ***
## tertiary-primary 426.8090 < 0.0000000000000002 ***
## secondary vocational-secondary gen 70.0425 0.10971
## tertiary-secondary gen 230.0490 0.0000015181 ***
## tertiary-secondary vocational 160.0065 0.0000092217 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
These results show that all but one pair of groups have statistically significant differences in their medians.
See: https://stats.stackexchange.com/questions/126686/how-to-read-the-results-of-dunns-test
This also confirms what we saw above for the ANOVA post hoc tests.
Now that you have established the statistical significance, you may think, what is the substantive significance of group membership in determining the level of emancipative values.
To answer this question, look at the share of variance explained by education groups in all the variance of emancipative values. This is called the ‘effect size’ in ANOVA. The larger it is, the more important your finding. Any effect below .01 is considered small, .06 is medium, and .15 is large (see Field, Chapter 16).
omega_sq <- function(aov.out){ # this is a function for the effect size. It is based on the output of 'aov' above
sum_stats <- summary(aov.out)[[1]]
SSm <- sum_stats[["Sum Sq"]][1]
SSr <- sum_stats[["Sum Sq"]][2]
DFm <- sum_stats[["Df"]][1]
MSr <- sum_stats[["Mean Sq"]][2]
W2 <- (SSm-DFm*MSr)/(SSm+SSr+MSr)
return(W2)
}
omega_sq(aov.out) # replace the 'aov.out' with the name of your ANOVA resulting object## [1] 0.03604635
The result is .04 - this is low to medium effect. It means that, even though there is a statistically significant difference in level of emancipative values across education groups and this difference is statistically significant across all but one pairs of groups, in practical terms this effect is not big at all.
Examples of ANOVA scripts (read through them and take the best solutions):