We are APA and we are studying Media and social trust in Belgium. For this purpose we used data from European Social Survey in 2016. In this work we will show you our findings that we got with the help of chi-square test, t-test and ANOVA, but firstly, we should look at and describe our variables.

Exploring variables

Variable 1

The first variable that we took is how much time people spend watching, reading or listening to news about politics and current affairs (in minutes). In dataset it is called “nwspol”. This is a continious variable, so we can build a histogram to see how this variable is distributed. Firstly, we are going to see general information about it. Here it is:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   30.00   45.00   61.86   60.00 1200.00

Standard deviation is 91.4488932

To see the distribution let’s build a histogram:

Unfortunately, outliers disable us to see distribution properly, so we decided to get rid of them and now we see that our variable is right-skewed and most of the data is concentrated under 100 minutes.

Variable 2

The next variable is “netusoft” that describes how often people use the Internet. It has 5 categories that are “Never, Only occasionally, A few times a week, Most days, Every day”. Since this variable is ordinal, we cannot measure the mean, median, and so on. However, we can observe how much cases appear in each category:

So, as we can see from the barchart, the overwhelming majority of responders use Internet every day, what can be a good advantage for the further research.

Variable 3

The third variable is “netustm” which is interval variable, that shows how much time (in minutes) people spend in Internet, so we can scale it.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    60.0   120.0   185.7   240.0  1320.0     426

To calculate the standard deviation we should get rid of NAs after that it equals 161.7429768
Now let’s look at distribution:

Here we also see right-skewness as in the first variable.

Variable 4

The next variable is ordinal one “pplfair” that shows how people feel about fairness of interaction with others, this variable has following scale: 0 = Most people try to take advantage of me, 1, 2, 3, 4, 5, 6, 7, 8, 9, Most people try to be fair = 10. So let`s see what it looks like:

From this graph we can conclude that in Belgium people think quite good about those who interact with them in a sense of honesty.

Chi-square

For chi-square test we decided to look closer at how the frequency of Internet usage is distributed between genders. And, firstly, we build a table to see if there are enough observations in general and in each cell.

Number of observations
Never Only occasionally A few times a week Most days Every day
Male 115 39 48 88 597
Female 129 37 57 89 567

The table and each cell have enough observations.
H0: The frequency of Internet usage does not depend on gender
H1: The frequency of Internet usage depends on gender

## 
##  Pearson's Chi-squared test
## 
## data:  ESS$gndr and ESS$netusoft
## X-squared = 2.37, df = 4, p-value = 0.6681

So, the p-value is more than 0.05, therefore, we cannot reject the null hypothesis. Pearsons residuals show that our observed data deviates from expected values not even by one standard deviation, what also confirms independence of tested features. Finally, we can say that frequency of Internet usage does not depend on gender.

T-test

Now we are going to see if there is any difference in time spent on the Internet (variable netustm) between the group of people who estimated other’s fairness (variable pplfair) as low (from 0 to 3 in varible’s scale) and who estimated as high (from 7 to 10), and we also made a group “Neutral” (from 4 to 6), but not for the t-test. We took these varibles because time spent on the Internet partly represents how people are involved in media, whereas people’s etimates of other’s fairness can be a way to describe social trust. Here we will use a t-test to see if there is a statistically significant difference in means between these groups. To apply this test we need to check the assumption of normality of distribution and variance equality.

Checking normality

Firstly, let’s see again how variable “netustm” is distributed. So, our data is right-skewed and most of it is concentrated at certain values and to see a better picture we made a logarithmic transformation:

Now it looks more like normal distribution than it was before the transformation. We also need to look at QQ plot to see how quantiles of normal distribution and our variable’s distribution are located. As can be seen, quantiles are not much in-line with each other, that also goes against our assumption of normality.

Next, we made formal tests: Kolmogorov-Smirnov test:
H0: variable has normal distribution
H1: our variable has other distribution

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  log(ESSttest$netustm + 1)
## D = 3.6606, p-value < 2.2e-16
## alternative hypothesis: two-sided

So, p-value < 2.2e-16, therefore we cannot accept our null hypothesis.

Shapiro-Wilk test:
H0: variable has normal distribution
H1: our variable has other distribution

## 
##  Shapiro-Wilk normality test
## 
## data:  log(ESSttest$netustm + 1)
## W = 0.98137, p-value = 5.417e-08

So, p-value = 5.417e-08, therefore we cannot accept our null hypothesis.

Checking variance equality

Consider that Barlett test is more sensitve to normality of data distribution, we decided to use Levene’s test which is less sensitive to this assumption.

Levene’s test:
H0: variances are equal
H1: variances are not equal

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  0.4039 0.5253
##       725

So, p-value is higher than 0.05, therefore we can accept the equality of variance

T-test

Despite the fact that we cannot formally accept normality of our variable, the histogram, however, reminds normal distribution and the variances are equal, we have also enough number of observations. Moreover, there are no variables in our data which are connected to our topic and unfortunately, I am not quite good at non-linear transformations for now (it probably could help us to transform variable distribution to normal better than simple logarithmation we did). So, we will use a t-test, taking into account that our variable somehow deviates from normal distribution.

H0: there is no difference in average time (in minutes) people spend on the Internet between those who estimate others as fair and those who estimate others as unfair
H1: there is a difference in average time (in minutes) people spend on the Internet between those who estimate others as fair and those who estimate others as unfair

## 
##  Two Sample t-test
## 
## data:  log(ESSttest$netustm + 1) by ESSttest$pplfair
## t = -0.31089, df = 725, p-value = 0.756
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1921260  0.1395966
## sample estimates:
##   mean in group Others are fair mean in group Others are unfair 
##                        4.884671                        4.910936

P-value = 0.756, therefore we cannot reject the null hypothesis. It means that there is no statistically significant difference in average time spend on the Internet between those people who estimate others as unfair and those who estimate others as fair.

ANOVA

We are going to see if there is a difference in average time spent on news (reading, watching or listening, variable nwspol - continious) between people grouped by their frequency of Internet usage(netusoft - categorical ordinal variable). So, to apply relevant test lets check the assumptions.

Checking normality

Firstly, lets see how our data is distributed

As we see distribution in each group is not normal and there are also a lot of outliers, therefore let’s get rid of them and make a log-transformation.

Now they look more normally, but this is not enough, so here is a table of Shapiro-Wilk test’s p-value for each group:

Table of groups(internet usage frequency) and p-value of Shapiro-Wilk test
Group P.value
Never 3.19212333244555e-16
Only occasionally 2.67497989663366e-05
A few times a week 6.60301487953475e-08
Most days 9.71606690222617e-11
Every day 5.0107000981577e-28

Formal test in each group shows us significant deviation from normal distribution, therefore we will use non-parametric test after applying a suitable ANOVA test to check does ANOVA give us undistorted result.

Variances

Now let’s check homogeneity of variances
H0: variances are equal
H1: variances are unequal

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  0.4189 0.7951
##       1761

So, as we have equal variances (that is concluded from Levene’s test p-value is quite big), therefore using ANOVA we will be considered that variances are equal

ANOVA

Now it’s hightime to RUN ANOVA
H0: average time spent on news are the same between groups of Internet usage
H1: group means are different (at least one pair)

## 
##  One-way analysis of means
## 
## data:  log(ESSanov$nwspol + 1) and ESSanov$netusoft
## F = 3.8793, num df = 4, denom df = 1742, p-value = 0.003841

Sо, we accept H1 (F(4, 1742)= 3.8793, p-value = 0.003841), and this means that at least one pair of groups have significant differences in mean, but we still remember that our data is not normally distributed, therefore we should check whether we can rely on ANOVA result or no, by making Kruskal-Wallis test and here it is:
H0: group means are the same
H1: group means are different (at least one pair)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  log(ESSanov$nwspol + 1) by ESSanov$netusoft
## Kruskal-Wallis chi-squared = 30.92, df = 4, p-value = 3.179e-06

Well, p-value again shows us signficant difference between groups. Therefore we can make Tukey’s honestly significant differences test, to see what groups differ in mean.

Tukes results
diff lwr upr p adj
Only occasionally-Never -0.0845346 -0.4514539 0.2823847 0.9703942
A few times a week-Never -0.2448618 -0.5693962 0.0796727 0.2380104
Most days-Never -0.3173913 -0.5940204 -0.0407622 0.0151344
Every day-Never -0.2538570 -0.4506340 -0.0570800 0.0040023
A few times a week-Only occasionally -0.1603271 -0.5796635 0.2590092 0.8347908
Most days-Only occasionally -0.2328567 -0.6163199 0.1506065 0.4603709
Every day-Only occasionally -0.1693223 -0.4998394 0.1611948 0.6284590
Most days-A few times a week -0.0725295 -0.4156576 0.2705985 0.9784159
Every day-A few times a week -0.0089952 -0.2917219 0.2737315 0.9999871
Every day-Most days 0.0635343 -0.1625972 0.2896658 0.9399700

So, we can easliy identify that there are only two pairs of groups that are signficantly different that are those people who use Internet most days and every day and those who never use Internet, that will be quite interesting to see which of these groups spend more time on reading or listening news. For that we build a comparative boxplots:

Surprisingly, we found out that people who never use Internet dedicate more time on news than those who most days or every day use it. So, maybe we can conclude that Internet usage is not a cruсial factor for peoples’ attachment to media (especially to news).
However it is obvious that our explanatory factor does not explain variance enough(only two pairs significant difference), but we still can estimate effect size by omega squared.

omega_sq(aov.out = aov((log(ESSanov$nwspol + 1) ~ ESSanov$netusoft)))
## [1] 0.006549277

So, as we supposed, effect size is small

Conclusion

In conclusion, we can sum up our results:

1.We found out that frequency of the Internet usage does not depend on gender (we used chi-square to test it).

2.We found out that duration of the Internet usage also does not depend on gender (we used t-test for this).

3.We found out that those people who never use the Internet spend more time on news (we used oneway ANOVA for this).