Kulikov Artyom, Vlasenko Anastasia, Artyushin Alexey and Bykova Nadezhda
15.03.2019
Areas of members’ responsibilities:
While the country of our interest is still Germany, in this particular project we use data from two topics, namely Social Demographics and Media and Social Trust, in order to investigate the connection between them. The data was collected in 2016.
Here we upload data and delete NAs from our variable of interest which is the highest level of education obtained by a person according to the International Standard Classification of Education (ISCED).
data1 <- read.spss("C:/Users/Anastasia/Downloads/ESS8DE.sav",
use.value.labels = TRUE, to.data.frame = TRUE)
data1 <- data1[!is.na(data1$eisced),]While there is also a continious variable for total years of education (eduyrs), whe have chosen the categorical variable (eisced) which indicates the highest level of education obtained by a person according to the International Standard Classification of Education (ISCED). According to ESS website, this variable serves as the best indicator of the education.
Lets’s have a look how this varible is structured.
## Not possible to harmonise into ES-ISCED
## 0
## ES-ISCED I , less than lower secondary
## 74
## ES-ISCED II, lower secondary
## 262
## ES-ISCED IIIb, lower tier upper secondary
## 1050
## ES-ISCED IIIa, upper tier upper secondary
## 123
## ES-ISCED IV, advanced vocational, sub-degree
## 597
## ES-ISCED V1, lower tertiary education, BA level
## 288
## ES-ISCED V2, higher tertiary education, >= MA level
## 440
## Other
## 0
Now, let’s have a look at the bar chart representing how the observations are distributed in groups.
Even on the ESS website it is sugested to recode this variable into a new variable with three levels: primary (1,2), secondary (3,4) and tertiary (5,6,7) education, so we decided to do so with a slight change in number of levels. We assigned first two categories (less than lower secondary, lower secondary) into “Low Secondary” level, following two categories (lower tier upper secondary, upper tier upper secondary) were assigned to the “Upper Secondary” level, advanced vocational categorie remained as it is, new level “Tertiary” was created for last two categories (lower tertiary education, higher tertiary education). All of the changes have been done in order to make groups more comparable.
We can now look at the way our new variable is distributed.
As seen from the barplot, the majority of people have upper secondary education; their share in population is a little bit more than 40%. The low secondary education on the contrary is the least widesread in Germany. Tertiary education is attained by a quarter of population. And people with advanced vocational education account for one fifth of the poopulation.
The second variable (nwspol) of our interest is connected with the topic of political news. This variable represents the time each respondent spends everyday reading, watching or listening to news connected with politics (in minutes).
And now let’s visualize this variable as a density plot and as a histogram.
ggplot(data1, aes(x = data1$nwspo)) +
geom_histogram(fill="#DDA0DD")+
labs(title = "Distribution of time spent on political news", x = "Minutes", y = "Frequency")+
geom_vline(aes(xintercept = mean(data1$nwspo, na.rm = TRUE), colour="Mean"), lwd=1.1 )+
labs(colour="")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Well, it is seen from the histogram that the distribution does not look normal. It is left-skewed. Also, there is no general trend: big and small values alternate with each other. Speaking about density plot, it is double humped, as it is clearly seen. The mean takes the value of approximately 22.5 minutes. The most people chose an answer around 23-25 minutes, which is the mode.
Since we want to work with these variables together, we need to visualize descriptives of the continious variable by groups of categorical one.
| Education | N | Mean | SD | Median | Min | Max | Skew | Kurtosis | St.error |
|---|---|---|---|---|---|---|---|---|---|
| Advanced vocational | 597 | 22.76 | 10.98 | 26 | 1 | 69 | 0.51 | 0.31 | 0.45 |
| Low Secondary | 336 | 18.79 | 12.10 | 15 | 1 | 64 | 0.81 | 0.43 | 0.66 |
| Tertiary | 728 | 23.91 | 11.03 | 26 | 1 | 73 | 0.72 | 1.19 | 0.41 |
| Upper Secondary | 1173 | 22.57 | 11.38 | 26 | 1 | 74 | 0.54 | 0.57 | 0.33 |
We hypothesize that education of a respondent and the time he/she spends watching political news are not independent variables. But for this particular project we want to see if the division into groups would be useful in the explanation of variance.
mean(data1$nwspo)## [1] 22.50635
As we see it from the aggregated data, difference in means between the categories and between the mean for the whole dataset is relatively small.
If we were to judge solely by the table, it would be quite hard to draw any conclusions: means of the groups seem to be quite similar (22,76; 18,79; 23,91; 22,57) to each other. Even though we might expect the F-ratio to be quite small, in order to check whether there is a, in fact, statistical difference in group means we need to run the formal test.
Let’s create a boxplot to see how these two variables look together.
Well, three out of four groups (advanced vocational, tertiary and upper secondary education) look quite similar with about the same means, however, outliers in these groups are distributed differently. The group of low secondary level of education looks differently from the other two with lower mean. So, according to this boxplot, it is possible to assume that having low secondary education implies spending less time on reading or watching political news, while other three levels of education do not seem to affect variability in time-spending, however all three groups show bigger numbers of minutes spent on watching/reading about politics.
Initially, our choice of variables was driven by the hypothetical assumption that people who are more educated tend to watch/read more about the politics. First of all, they have more free time in comparison with people with low educational level that have to start work as early as possible. Secondly, more educated people have better understanding of what is going on in the political spehere of life and aree more interested in it partially because of their undersatnding, partically because of the feeling that they can actually contribute to it somehow.
We have found that according to the survey published in the article “The conversation” in 2017, people on unstable job positions show less interest in politics in comparison to people of economically secure positions and are more likely to be an undecided voter or to not vote at all. Nearly 40% of people from economically insecure positions claimed that they can not change anything with their vote. Instead of trying to rebel against existing situation, these people just turn their backs to politics. Our data was collected in 2016, the article was published in 2017 - both before the presidential elections in 2017.
We can hypothesize that from sociological point of view these two ideas are connected: people with lower education tend to get unstable jobs and turn their backs to politics both because they feel like thie votes don’t count and becuase they have little knowledge of how the system works. It is the other way around for people with higher education.
We are aware of the fact that ANOVA is a parametric test, which is why we have to check these assumptions before working on it:
The team working on European Social Survey are aware of the importance of the quality of data, which is why they always monitor it using different measures such as Multitrait-Multimethod (MTMM) approach (to assess reliability, validity and method effects of the questions), Survey Quality Predictor (SQP). They additionally check Measurement equivalence and Measurement quality of concepts and publish quality reports where the methodologies are described.
We needed to check the way data was collected because of the first assumption. We have carefully looked through our data and its description on the Social Survey website to make sure that the observations we chose are independent.
In order to check whether variances of our variables are equal, we now are moving on to Levene’s test for homogenity of variances.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 2.3949 0.06647 .
## 2830
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since P-value is equal to 0.06 which is relatively big compared to 0.05, we have 0.06 probability to get the data we are working with if H0 is true for the population. We have no right to reject the null hypothesis and conclude that variances do not differ from each other.
We now start working with the actual F-test. We use it in order to find out whether there is an association.
These are the hypothesis for the test:
##
## One-way analysis of means
##
## data: data1$nwspo and data1$educ
## F = 16.013, num df = 3, denom df = 2830, p-value = 2.536e-10
## Df Sum Sq Mean Sq F value Pr(>F)
## data1$educ 3 6131 2043.6 16.01 2.54e-10 ***
## Residuals 2830 361166 127.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
F(3, 2830) = 16.01, p-value < .001 means that the null hypothesis should be rejected (the probability to obtain such data that we have if H0 was true for the population is low), thus, the difference in the time spent watching news across education groups is statistically significant.
Now, in order to determine normality graphically we can use the output of a normal Q-Q Plot and check the third assumption: if residuals are normally distributed or not. So, we create a Q-Q plot to do this.
We can see how the data points are close to the diagonal line in Q-Q plot and almost straight red lines in upper two graphs. Therefore we can conclude that our residuals are distributed normally.
But we would like to check it again with a more formal procedure.
## vars n mean sd median trimmed mad min max range skew
## X1 1 2834 0 11.29 2.09 -0.63 14.31 -22.91 51.43 74.35 0.62
## kurtosis se
## X1 0.67 0.21
Skew and kurtosis should be < 2, here we see that they are 0.62 and 0.67 respectively
We are now moving on to the Shapiro-Wilk normality test.
##
## Shapiro-Wilk normality test
##
## data: anova.res
## W = 0.96145, p-value < 2.2e-16
Shapiro test helps us to check hypothesis about Normality. Unfortunately, p-value is extremely small (0.05<) which indicates that the residuals are not distributed normally because as a rule if p-value is less or equal to 0,05 then we should conclude that the residuals did not come from a Normal distribution. But let’s take a look at the histogram.
Overall, skew and kurtosis were OK and the histogram looks at least somewhat appropriate even though the Shapiro-Wilk test that our residuals were not distributed normally. As a conclusion of the visual analysis, we claim that normality assumption holds.
We have already established that F-ratio is statistically significant and now we want to find out which specific groups’ means (in comparison with each other) are statistically different. For this purpose we move on to the Tukey’s Honestly Significant Differences’ post hoc test. We chose it over the pairwise t-test with Bonferroni’s correction since we have already established that the variances of groups are equal.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = data1$nwspo ~ data1$educ)
##
## $`data1$educ`
## diff lwr upr
## Low Secondary-Advanced vocational -3.9701035 -5.9505885 -1.98961846
## Tertiary-Advanced vocational 1.1560412 -0.4473616 2.75944397
## Upper Secondary-Advanced vocational -0.1901665 -1.6501166 1.26978357
## Tertiary-Low Secondary 5.1261447 3.2109054 7.04138394
## Upper Secondary-Low Secondary 3.7799370 1.9830799 5.57679400
## Upper Secondary-Tertiary -1.3462077 -2.7163444 0.02392902
## p adj
## Low Secondary-Advanced vocational 0.0000016
## Tertiary-Advanced vocational 0.2486925
## Upper Secondary-Advanced vocational 0.9870671
## Tertiary-Low Secondary 0.0000000
## Upper Secondary-Low Secondary 0.0000004
## Upper Secondary-Tertiary 0.0562520
By looking at adjusted p-values for all possible pairs, we can see that the difference between group means is statistically significant only in case of three pairs out of six. P-values of comparison between Tertiary and Advanced vocational, and Upper Secondary and Advanced vocational education are exceptionally big, while the p-value of comparison between Upper Secondary and Tertiary education is just slightly bigger than 0.05. We can also make sure that everything is right by looking at actual differences between grop means: group comparisons with large p-values have small difference in means, namely, 1.1560412, -0.1901665 and -1.3462077.
We now proceed to plot these differences in means.
By looking at the plot, it is quite clear that two lines (Tertiary - Advanced vocational and Upper Secondary - Advanced vocational) cross the zero value while it is unclear if the third line (Upper Secondary - Tertiary) crosses it or not. However, we already know that, even though the p-value reported for this pair is quite small, it is still bigger than 0.05 and, in fact, this line does cross the zero value. Remaining three lines do not cross the zero value which means that differences between these groups are statistically significant.
Now we want to double check our results by performing the non-parametric equivalent of ANOVA - the Kruskal-Wallis test. Results of it will be less reviable, but we wanted to compare it with the results of actual ANOVA. In this case we do not have to check any assumptions and can just proceed to the test and its hypothesis.
##
## Kruskal-Wallis rank sum test
##
## data: nwspo by educ
## Kruskal-Wallis chi-squared = 54.94, df = 3, p-value = 7.072e-12
As we can see, with KW chi-square(3) = 54.94, p-value is < .001, which which means that the null hypothesis is rejected and the differences between mean ranks of groups turn out to be statistically significant. This result confirms what we saw earlier in the ANOVA test.
Since the result of this test is significant, we would also like to run Dunn’s test.
##
## Dunn's test of multiple comparisons using rank sums : holm
##
## mean.rank.diff pval
## Low Secondary-Advanced vocational -313.73514 5.2e-08 ***
## Tertiary-Advanced vocational 76.73315 0.1718
## Upper Secondary-Advanced vocational -18.25290 0.6537
## Tertiary-Low Secondary 390.46829 1.5e-12 ***
## Upper Secondary-Low Secondary 295.48224 1.8e-08 ***
## Upper Secondary-Tertiary -94.98605 0.0385 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As it can be seen, there are three groups with statistically significant difference in their medians.
Now we want to know what is the substantive significance of group membership in determining the time spent watching political news. Therefore, we look at the share of variance that is explained by education groups in all the variance of time spent watching/reading political new - at the ‘effect size’.
## [1] 0.01564355
We get the result of 0.015 which indicates a low-medium effect size. It means that, even though there is a statistically significant difference in level of time spent watching political news across education groups and this difference is statistically significant across three pairs, in practical terms this effect is not large at all.
Overall, having conducted the tests, we have come to the conclusions that different groups actually DO spend on average different amounts of time on political news. We conclude that people with tertiary education spend more time than others reading/listening the news. At the same time people with low secondary education only spend the least amount of time on news.
Thus, we also state that the difference for tertiary vs. advanced vocational and upper secondary vs. advanced vocational was not significant.
Thank you for bearing with us.