We are APA and we are studying Media and social trust in Belgium. For this purpose we used data from European Social Survey in 2016. In this work we will show you our findings that we got with the help of chi-square test, t-test and ANOVA, but firstly, we should look at and describe our variables.
The first variable that we took is how much time people spend watching, reading or listening to news about politics and current affairs (in minutes). In dataset it is called “nwspol”. This is a continious variable, so we can build a histogram to see how this variable is distributed. Firstly, we are going to see general information about it. Here it is:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 30.00 45.00 61.86 60.00 1200.00
Standard deviation is 91.4488932
To see the distribution let’s build a histogram:
Unfortunately, outliers disable us to see distribution properly, so we decided to get rid of them and now we see that our variable is right-skewed and most of the data is concentrated under 100 minutes.
The next variable is “netusoft” that describes how often people use the Internet. It has 5 categories that are “Never, Only occasionally, A few times a week, Most days, Every day”. Since this variable is ordinal, we cannot measure the mean, median, and so on. However, we can observe how much cases appear in each category:
So, as we can see from the barchart, the overwhelming majority of responders use Internet every day, what can be a good advantage for the further research.
The third variable is “netustm” which is interval variable, that shows how much time (in minutes) people spend in Internet, so we can scale it.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 60.0 120.0 185.7 240.0 1320.0 426
To calculate the standard deviation we should get rid of NAs after that it equals 161.7429768
Now let’s look at distribution:
Here we also see right-skewness as in the first variable.
The next variable is ordinal one “pplfair” that shows how people feel about fairness of interaction with others, this variable has following scale: 0 = Most people try to take advantage of me, 1, 2, 3, 4, 5, 6, 7, 8, 9, Most people try to be fair = 10. So let`s see what it looks like:
From this graph we can conclude that in Belgium people think quite good about those who interact with them in a sense of honesty.
For chi-square test we decided to look closer at how the frequency of Internet usage is distributed between genders. And, firstly, we build a table to see if there are enough observations in general and in each cell.
| Never | Only occasionally | A few times a week | Most days | Every day | |
|---|---|---|---|---|---|
| Male | 115 | 39 | 48 | 88 | 597 |
| Female | 129 | 37 | 57 | 89 | 567 |
The table and each cell have enough observations.
H0: The frequency of Internet usage does not depend on gender
H1: The frequency of Internet usage depends on gender
##
## Pearson's Chi-squared test
##
## data: ESS$gndr and ESS$netusoft
## X-squared = 2.37, df = 4, p-value = 0.6681
So, the p-value is more than 0.05, therefore, we cannot reject the null hypothesis. Pearsons residuals show that our observed data deviates from expected values not even by one standard deviation, what also confirms independence of tested features. Finally, we can say that frequency of Internet usage does not depend on gender.
Now we are going to see if there is any difference in time spent on the Internet (variable netustm) between the group of people who estimated other’s fairness (variable pplfair) as low (from 0 to 3 in varible’s scale) and who estimated as high (from 7 to 10), and we also made a group “Neutral” (from 4 to 6), but not for the t-test. We took these varibles because time spent on the Internet partly represents how people are involved in media, whereas people’s etimates of other’s fairness can be a way to describe social trust. Here we will use a t-test to see if there is a statistically significant difference in means between these groups. To apply this test we need to check the assumption of normality of distribution and variance equality.
Firstly, let’s see again how variable “netustm” is distributed. So, our data is right-skewed and most of it is concentrated at certain values and to see a better picture we made a logarithmic transformation:
Now it looks more like normal distribution than it was before the transformation. We also need to look at QQ plot to see how quantiles of normal distribution and our variable’s distribution are located. As can be seen, quantiles are not much in-line with each other, that also goes against our assumption of normality.
Next, we made formal tests: Kolmogorov-Smirnov test:
H0: variable has normal distribution
H1: our variable has other distribution
##
## One-sample Kolmogorov-Smirnov test
##
## data: log(ESSttest$netustm + 1)
## D = 3.6606, p-value < 2.2e-16
## alternative hypothesis: two-sided
So, p-value < 2.2e-16, therefore we cannot accept our null hypothesis.
Shapiro-Wilk test:
H0: variable has normal distribution
H1: our variable has other distribution
##
## Shapiro-Wilk normality test
##
## data: log(ESSttest$netustm + 1)
## W = 0.98137, p-value = 5.417e-08
So, p-value = 5.417e-08, therefore we cannot accept our null hypothesis.
Consider that Barlett test is more sensitve to normality of data distribution, we decided to use Levene’s test which is less sensitive to this assumption.
Levene’s test:
H0: variances are equal
H1: variances are not equal
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.4039 0.5253
## 725
So, p-value is higher than 0.05, therefore we can accept the equality of variance
Despite the fact that we cannot formally accept normality of our variable, the histogram, however, reminds normal distribution and the variances are equal, we have also enough number of observations. Moreover, there are no variables in our data which are connected to our topic and unfortunately, I am not quite good at non-linear transformations for now (it probably could help us to transform variable distribution to normal better than simple logarithmation we did). So, we will use a t-test, taking into account that our variable somehow deviates from normal distribution.
H0: there is no difference in average time (in minutes) people spend on the Internet between those who estimate others as fair and those who estimate others as unfair
H1: there is a difference in average time (in minutes) people spend on the Internet between those who estimate others as fair and those who estimate others as unfair
##
## Two Sample t-test
##
## data: log(ESSttest$netustm + 1) by ESSttest$pplfair
## t = -0.31089, df = 725, p-value = 0.756
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1921260 0.1395966
## sample estimates:
## mean in group Others are fair mean in group Others are unfair
## 4.884671 4.910936
P-value = 0.756, therefore we cannot reject the null hypothesis. It means that there is no statistically significant difference in average time spend on the Internet between those people who estimate others as unfair and those who estimate others as fair.
We are going to see if there is a difference in average time spent on news (reading, watching or listening, variable nwspol - continious) between people grouped by their frequency of Internet usage(netusoft - categorical ordinal variable). So, to apply relevant test lets check the assumptions.
Firstly, lets see how our data is distributed
As we see distribution in each group is not normal and there are also a lot of outliers, therefore let’s get rid of them and make a log-transformation.
Now they look more normally, but this is not enough, so here is a table of Shapiro-Wilk test’s p-value for each group:
| Group | P.value |
|---|---|
| Never | 3.19212333244555e-16 |
| Only occasionally | 2.67497989663366e-05 |
| A few times a week | 6.60301487953475e-08 |
| Most days | 9.71606690222617e-11 |
| Every day | 5.0107000981577e-28 |
Formal test in each group shows us significant deviation from normal distribution, therefore we will use non-parametric test after applying a suitable ANOVA test to check does ANOVA give us undistorted result.
Now let’s check homogeneity of variances
H0: variances are equal
H1: variances are unequal
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 0.4189 0.7951
## 1761
So, as we have equal variances (that is concluded from Levene’s test p-value is quite big), therefore using ANOVA we will be considered that variances are equal
Now it’s hightime to RUN ANOVA
H0: average time spent on news are the same between groups of Internet usage
H1: group means are different (at least one pair)
##
## One-way analysis of means
##
## data: log(ESSanov$nwspol + 1) and ESSanov$netusoft
## F = 3.8793, num df = 4, denom df = 1742, p-value = 0.003841
Sо, we accept H1 (F(4, 1742)= 3.8793, p-value = 0.003841), and this means that at least one pair of groups have significant differences in mean, but we still remember that our data is not normally distributed, therefore we should check whether we can rely on ANOVA result or no, by making Kruskal-Wallis test and here it is:
H0: group means are the same
H1: group means are different (at least one pair)
##
## Kruskal-Wallis rank sum test
##
## data: log(ESSanov$nwspol + 1) by ESSanov$netusoft
## Kruskal-Wallis chi-squared = 30.92, df = 4, p-value = 3.179e-06
Well, p-value again shows us signficant difference between groups. Therefore we can make Tukey’s honestly significant differences test, to see what groups differ in mean.
| diff | lwr | upr | p adj | |
|---|---|---|---|---|
| Only occasionally-Never | -0.0845346 | -0.4514539 | 0.2823847 | 0.9703942 |
| A few times a week-Never | -0.2448618 | -0.5693962 | 0.0796727 | 0.2380104 |
| Most days-Never | -0.3173913 | -0.5940204 | -0.0407622 | 0.0151344 |
| Every day-Never | -0.2538570 | -0.4506340 | -0.0570800 | 0.0040023 |
| A few times a week-Only occasionally | -0.1603271 | -0.5796635 | 0.2590092 | 0.8347908 |
| Most days-Only occasionally | -0.2328567 | -0.6163199 | 0.1506065 | 0.4603709 |
| Every day-Only occasionally | -0.1693223 | -0.4998394 | 0.1611948 | 0.6284590 |
| Most days-A few times a week | -0.0725295 | -0.4156576 | 0.2705985 | 0.9784159 |
| Every day-A few times a week | -0.0089952 | -0.2917219 | 0.2737315 | 0.9999871 |
| Every day-Most days | 0.0635343 | -0.1625972 | 0.2896658 | 0.9399700 |
So, we can easliy identify that there are only two pairs of groups that are signficantly different that are those people who use Internet most days and every day and those who never use Internet, that will be quite interesting to see which of these groups spend more time on reading or listening news. For that we build a comparative boxplots:
Surprisingly, we found out that people who never use Internet dedicate more time on news than those who most days or every day use it. So, maybe we can conclude that Internet usage is not a cruсial factor for peoples’ attachment to media (especially to news).
However it is obvious that our explanatory factor does not explain variance enough(only two pairs significant difference), but we still can estimate effect size by omega squared.
omega_sq(aov.out = aov((log(ESSanov$nwspol + 1) ~ ESSanov$netusoft)))
## [1] 0.006549277
So, as we supposed, effect size is small
In conclusion, we can sum up our results:
1.We found out that frequency of the Internet usage does not depend on gender (we used chi-square to test it).
2.We found out that duration of the Internet usage also does not depend on gender (we used t-test for this).
3.We found out that those people who never use the Internet spend more time on news (we used oneway ANOVA for this).