Project 3. ANOVA (one-way analysis of variance in R)

Kulikov Artyom, Vlasenko Anastasia, Artyushin Alexey and Bykova Nadezhda

15.03.2019

Team: Rteam&DA

Areas of members’ responsibilities:

While the country of our interest is still Germany, in this particular project we use data from two topics, namely Social Demographics and Media and Social Trust, in order to investigate the connection between them. The data was collected in 2016.

Here we upload data and delete NAs from our variable of interest which is the highest level of education obtained by a person according to the International Standard Classification of Education (ISCED).

data1 <- read.spss("C:/Users/Anastasia/Downloads/ESS8DE.sav", 
                   use.value.labels = TRUE, to.data.frame = TRUE)
data1 <- data1[!is.na(data1$eisced),]

Variables

While there is also a continious variable for total years of education (eduyrs), whe have chosen the categorical variable (eisced) which indicates the highest level of education obtained by a person according to the International Standard Classification of Education (ISCED). According to ESS website, this variable serves as the best indicator of the education.

Lets’s have a look how this varible is structured.

##             Not possible to harmonise into ES-ISCED 
##                                                   0 
##              ES-ISCED I , less than lower secondary 
##                                                  74 
##                        ES-ISCED II, lower secondary 
##                                                 262 
##           ES-ISCED IIIb, lower tier upper secondary 
##                                                1050 
##           ES-ISCED IIIa, upper tier upper secondary 
##                                                 123 
##        ES-ISCED IV, advanced vocational, sub-degree 
##                                                 597 
##     ES-ISCED V1, lower tertiary education, BA level 
##                                                 288 
## ES-ISCED V2, higher tertiary education, >= MA level 
##                                                 440 
##                                               Other 
##                                                   0

Now, let’s have a look at the bar chart representing how the observations are distributed in groups.

Even on the ESS website it is sugested to recode this variable into a new variable with three levels: primary (1,2), secondary (3,4) and tertiary (5,6,7), so we decided to do so with a slight change in number of levels. We assigned first two categories (less than lower secondary, lower secondary) into “Low Secondary” level, following two categories (lower tier upper secondary, upper tier upper secondary) were assigned to the “Upper Secondary” level, advanced vocational categorie remained as it is, new level “Tertiary” was created for last two categories (lower tertiary education, higher tertiary education).

We can now look at the way our new variable is distributed.

Variables (continued)

As seen from the barplot, the majority of people have upper secondary education; their share in population is a little bit more than 40%. The low secondary education on the contrary is the least widesread in Germany. Tertiary education is attained by a quarter of population. And people with advanced vocational education are of one fifth share.

The second variable (nwspol) of our interest is connected with the topic of political news. This variable represents the time each respondent spends everyday reading, watching or listening to news connected with politics.

And now let’s visualize this variable as a density plot and as a histogram.

ggplot(data1, aes(x = data1$nwspo)) +
  geom_histogram(fill="#DDA0DD")+
  labs(title = "Distribution of time spent on political news", x = "Minutes", y = "Frequency")+
  geom_vline(aes(xintercept = mean(data1$nwspo, na.rm = TRUE), colour="Mean"), lwd=1.1 )+
  labs(colour="")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Well, it is seen from the histogram that the distribution does not look like normal. It is left-skewed. Also, there is no strong tendencies. The big and small values alternate with each other. Speaking about density plot, it is double humped, as it is clearly seen. The line of mean takes the value of approximately 22.5 minutes. And the mode, the most frequent answer lies somewhere 23-25 minutes spent on television.

Since we want to work with these variables together, we need to visualize descriptives of the continious variable by groups of categorical one.

Time spent watching/reading political news by Education Level
Education N Mean SD Median Min Max Skew Kurtosis St.error
Advanced vocational 597 22.76 10.98 26 1 69 0.51 0.31 0.45
Low Secondary 336 18.79 12.10 15 1 64 0.81 0.43 0.66
Tertiary 728 23.91 11.03 26 1 73 0.72 1.19 0.41
Upper Secondary 1173 22.57 11.38 26 1 74 0.54 0.57 0.33

We hypothesize that education of a respondent and the time he/she spends watching political news are not independent variables. But for this particular project we want to see if the division into groups would be useful in the explanation of variance.

Let’s create a boxplot to see how these two variables look together.

Assumptions. Homogenity of variances and F-ratio, independence of observations

We are aware of the fact that ANOVA is a parametric test, which is why we have to check these assumptions before working on it:

As for the first assumption, we have carefully looked through our data and its description on the Social Survey website to make sure that the observations we chose are independent.

In order to check whether variances of our variables are equal, we now are moving on to Levene’s test for homogenity of variances.

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value  Pr(>F)  
## group    3  2.3949 0.06647 .
##       2830                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since P-value is equal to 0.06 which is relatively big compared to 0.05, then we have 0.06 probability to get the data we have given H0 is true. We have no right to reject the null hypothesis which means that variances are equal.

F-test

We now start working with the actual F-test.We use it for finding out whether there is an association

These are the hypothesis for the test:

## 
##  One-way analysis of means
## 
## data:  data1$nwspo and data1$educ
## F = 16.013, num df = 3, denom df = 2830, p-value = 2.536e-10
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## data1$educ     3   6131  2043.6   16.01 2.54e-10 ***
## Residuals   2830 361166   127.6                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F(3, 2830) = 16.01, p-value < .001 means that the null hypothesis should be rejected, thus, the difference in the time spent watching news across education groups is statistically significant.

Normality of residuals

Now we are moving on to check the third assumption: if residuals are normally distributed or not. We create a Q-Q plot in order to do this.

We can see an almost straight line in Q-Q plot and almost straight red lines in upper two graphs. Therefore we can conclude that our residuals are distributed normally.

But we are willing to check it again with a more formal procedure.

##    vars    n mean    sd median trimmed   mad    min   max range skew
## X1    1 2834    0 11.29   2.09   -0.63 14.31 -22.91 51.43 74.35 0.62
##    kurtosis   se
## X1     0.67 0.21

Here we see that skew is 0.62 and kurtosis is 0.67, both of them being less than 2.

We are now moving on to the Shapiro-Wilk normality test.

## 
##  Shapiro-Wilk normality test
## 
## data:  anova.res
## W = 0.96145, p-value < 2.2e-16

Unfortunately, p-value is extremely small (0.05<) which indicates that the residuals are not distributed normally. But let’s take a look at the histogram.

Overall, skew and kurtosis were OK and the histogram looks at least somewhat appropriate even though the Shapiro-Wilk test that our residuals were not distributed normally. As a conclusion of the visual analysis, we claim that normality assumption holds.

Post hoc test

We have already established that F-ratio is statistically significant and now we want to find out which specific groups’ means (in comparison with each other) are statistically different. For this purpose we move on to the Tukey’s Honestly Significant Differences’ post hoc test. We chose it over the pairwise t-test with Bonferroni’s correction since we have already shown that the variances of groups are equal.

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = data1$nwspo ~ data1$educ)
## 
## $`data1$educ`
##                                           diff        lwr         upr
## Low Secondary-Advanced vocational   -3.9701035 -5.9505885 -1.98961846
## Tertiary-Advanced vocational         1.1560412 -0.4473616  2.75944397
## Upper Secondary-Advanced vocational -0.1901665 -1.6501166  1.26978357
## Tertiary-Low Secondary               5.1261447  3.2109054  7.04138394
## Upper Secondary-Low Secondary        3.7799370  1.9830799  5.57679400
## Upper Secondary-Tertiary            -1.3462077 -2.7163444  0.02392902
##                                         p adj
## Low Secondary-Advanced vocational   0.0000016
## Tertiary-Advanced vocational        0.2486925
## Upper Secondary-Advanced vocational 0.9870671
## Tertiary-Low Secondary              0.0000000
## Upper Secondary-Low Secondary       0.0000004
## Upper Secondary-Tertiary            0.0562520

By looking at adjusted p-values for all possible pairs, we can see that the difference between group means is statistically significant only in case of three pairs out of six. P-values of comparison between Tertiary and Advanced vocational, and Upper Secondary and Advanced vocational are exceptionally big, while the p-value of comparison between Upper Secondary and Tertiary is just slightly bigger than 0.05. We can also make sure that everything is right by looking at actual differences between grop means: grou comparisons with large p-values have very small difference in means, namely, 1.1560412, -0.1901665 and -1.3462077.

We now proceed to plot these differances in means.

By looking at the plot, it is quite clear that two lines (Tertiary - Advanced vocational and Upper Secondary - Advanced vocational) cross the zero value while it is unclear if the third line (Upper Secondary - Tertiary) crosses it or not. However, we already know that, even though the p-value reported for this pair is quite small, it is still bigger than 0.05 and, in fact, this line does cross the zero value. Remaining three lines do not cross the zero value which means that differences between these groups are statistically significant.

Non-parametric equivalent of ANOVA

Now we want to double check our results by performing the non-parametric equivalent of ANOVA - the Kruskal-Wallis test. Results of it will be less reviable, but we wanted to compare it with the results of ANOVA. In this case we do not have to check any assumptions and can just proceed to the test and its hypothesis.

## 
##  Kruskal-Wallis rank sum test
## 
## data:  nwspo by educ
## Kruskal-Wallis chi-squared = 54.94, df = 3, p-value = 7.072e-12

As we can see, with KW chi-square(3) = 54.94, p-value is < .001, which which means that the null hypothesis is rejected and the differences between mean ranks of groups turn out to be statistically significant. This result confirms what we saw earlier in the ANOVA test.

Since the result of this test is significant, we would also like to run Dunn’s test. Now we want to know what is the substantive significance of group membership in determining the time spent watching political news. Therefore, we look at the share of variance that is explained by education groups in all the variance of time spent watching/reading political new - at the ‘effect size’.

## [1] 0.01564355

We get the result of 0.015 which indicates a low-medium effect size. It means that, even though there is a statistically significant difference in level of time spent watching political news across education groups and this difference is statistically significant across three pairs, in practical terms this effect is not large at all.

The end

Thank you for bearing with us.