Project 2

Data

ESS8RU_spss <- read.spss("ESS8RU.sav", use.value.labels = T, to.data.frame = T)

rus1 <- ESS8RU_spss %>% dplyr::select(nwspol, netusoft, netustm, ppltrst, pplfair, pplhlp, gndr,agea,atchctr,eduyrs) 
rus1 <- rus1 %>% filter(!is.na(netusoft))

rus1$gndr <- as.factor(rus1$gndr)

Chi-Square

Data and its visualization

How often do respondents use the internet?

ggplot(rus1, aes(x = netusoft, y = ..count../sum(..count..))) + 
  geom_bar(alpha = 0.5, fill = "lightblue", color = "black") +
  geom_text(aes(label = percent(..count../sum(..count..))), size = 3.5, stat= "count", position = position_stack(vjust = 0.5)) +
  labs(title = "Distribution of Internet use frequency by gender", x = "Frequencу of Internet use", y = "Percentage of respondents") +
  scale_y_continuous(labels = percent) +
  facet_grid(~gndr)+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

Then, we create a table for performing chi-square test:

inet_gndr
##                     
##                      Male Female
##   Never               255    468
##   Only occasionally    81    114
##   A few times a week  124    148
##   Most days           100    112
##   Every day           475    546

Hypotheses

Our hypotheses are the following:

H0: Gender and frequency of Internet use are independent.

H1: Gender and frequency of Internet use are dependent.

Assumptions

In order to perform a chi-square test for independence, we should check several assumptions:

  1. Each observation is independent of all the others. Our observations are independent of one another.

  2. All expected counts should be 10 or greater. In order to check this assumption, we have to see expected values.

chisq <- chisq.test(inet_gndr)

round(chisq$observed,2)
##                     
##                      Male Female
##   Never               255    468
##   Only occasionally    81    114
##   A few times a week  124    148
##   Most days           100    112
##   Every day           475    546

As can be seen, this restriction is also matched since every cell contains counts which are more than 10.

Conducting chi-square test for independence

chisq
## 
##  Pearson's Chi-squared test
## 
## data:  inet_gndr
## X-squared = 25.177, df = 4, p-value = 4.636e-05

P-value is really small, so chi-square test is statistically significant.

Investigating residuals

Since the results are significant, lets have a look at residuals:

chisq$stdres
##                     
##                            Male     Female
##   Never              -4.8320229  4.8320229
##   Only occasionally  -0.3465516  0.3465516
##   A few times a week  1.0164912 -1.0164912
##   Most days           1.3724784 -1.3724784
##   Every day           3.2331788 -3.2331788

Looking at the standardized residuals we can say that those people who watch the news " Only occasionally" or “Most days”

A number of men who use the Internet only occasionally were higher than expected (the value of standardized residual is 3.23) while the number of the female of the same category (“Only occasionally”) were fewer than expected (the value of standardized residual is -3.23). As for those people who use Internet most day, a number of males were observed fewer than expected (the value of standardized residual is -4.83) while females were observed more than expected (the value of standardized residual is 4.83).

corrplot::corrplot(chisq$stdres, addCoef.col = TRUE, is.cor = FALSE)

This plot confirms the results of the previous one. Overall, the residuals are significant for categories “Only occasionally” and “Most days”.

Conclusion

Since the p-value is very small, we can conclude that the data gives strong evidence against the Null Hypothesis. To be more precise, the data shows that there is an association between Gender and frequency of Internet use.

T-Test

Preparing Data

ESS8RU_spss1 <- read.spss("ESS8RU.sav", use.value.labels = T, to.data.frame = T)

We decided to add some new variables to enrich the quality of insights we may get from the information:

atchctr - emotional attachment to the country (0-10).

eduyrs - number of years of formal education.

rus2 <- ESS8RU_spss1 %>% dplyr::select(nwspol, netustm, ppltrst, pplfair, pplhlp, gndr,agea,atchctr,eduyrs)

T-test 1

Some people claim that there are quite a few women in governmental structures. One of the reasons suggested is that females are not so prone to be engaged in politics and current affairs in the country compared to males. That is why men are supposed to spend more time reading or listening to the news. Let`s check whether the assumption holds.

Hypotheses

H0: On average, men and women spend an equal amount of time reading the news on politics and current affairs. [the true difference in means is 0]

H1: On average, men and women are different in the amount of time spent on the news.[the true difference in means is not 0]

Data visualization

rus2$nwspol <- as.numeric(rus2$nwspol)
ggplot(data = rus2, aes(x = gndr, y = nwspol))+
  geom_boxplot(na.rm = T, color = "black", fill = "lightblue")+
  theme_bw()+
  stat_summary(fun.y = mean, geom = "point", col = "red", shape = 4, na.rm = T)+
  labs(title = "Time spent on political news and current affairs according to gender", x = "Gender", y = "Time, min")+
  scale_y_continuous(breaks = 0:15*5)

Here we can see that the median time spent on political news and current affairs is equal to both males and females.

Seems like the initial assumption doesn`t hold! It is time to check whether the difference is indeed insignificant by applying t-test.

Checking normality

1.1 Checking normality via histogram:

ggplot(rus2, aes(x = nwspol))+
  geom_histogram(aes(y = ..density..),binwidth = 4,color = "black", fill = "white", na.rm = T)+
  geom_density(alpha = 0.3, fill = "lightblue", na.rm = T)+
  labs(title = "Time spent on news", x = "Time, min", y = "Density")+
  scale_x_continuous(breaks = 0:20*5)+
  theme_bw() +
  facet_grid(rows = vars(gndr))

1.2 Checking normality via qqplot:

par(mfrow = c(1,2))
qqnorm(rus2$nwspol, main = "Females"); qqline(rus2$nwspol, col= 2)
qqnorm(rus2$nwspol, main = "Males"); qqline(rus2$nwspol, col= 2)

After checking normality using two different methods, we can conclude that the distribution of time spent on political news and current affairs is normal for both genders.

Conducting t-test

Time for applying t-test!

t.test(rus2$nwspol ~ rus2$gndr)
## 
##  Welch Two Sample t-test
## 
## data:  rus2$nwspol by rus2$gndr
## t = 2.7309, df = 2062.4, p-value = 0.006369
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.325059 1.981256
## sample estimates:
##   mean in group Male mean in group Female 
##             13.96393             12.81077

Conclusion

On average, men spend 14 min on watching or listening to the news on politics and current affairs, while women spend 12.8 min. The t-statistic t(2062.4) = 2.7 (p-value = 0.006369). It makes us conclude that the observed difference in means is statistically significant because the p-value is less than 0.05. Therefore, men are not equally likely to spend the same amount of time on news of politics and current affairs than women.

Conductung Mann-Whitney-Wilcoxon test

Now, let us double check our results with another test. As we have two independent samples (males and females), we should use Mann-Whitney-Wilcoxon Test.

wilcox.test(rus2$nwspol ~ rus2$gndr)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  rus2$nwspol by rus2$gndr
## W = 708100, p-value = 0.01051
## alternative hypothesis: true location shift is not equal to 0

Here the Wilcoxon rank sum W = 708100, p-value = 0.01051, which means that we can say that men and women are different in the amount of time they spend on the news. These results confirm the t-test conclusion.

T-test 2

Given that we are aware of the fact that there is a statistical gender difference in the degree of political involvement, it seems quite logical to assume other reasons as well to explain the variations in time spent on political news and current affairs.

It is fair to say that those people who feel more emotionally attached to the country would spend much more time reading or listening political news, as they are supposed to be highly concerned with the current situation on the political arena.

Hypotheses

H0: Those people who are attached to the country and those who are not attached to the country spend an equal amount of time on the news about politics and current affairs.

H1: Those people who are attached to the country and those who are not attached to the country spend different amount of time on the news about politics and current affairs.

Data visualization

#Getting rid of 'character' values in the dataset:

rus2$atchctr = ifelse(rus2$atchctr == "Not at all emotionally attached","0",rus2$atchctr)
rus2$atchctr = ifelse(rus2$atchctr == "Very emotionally attached","10",rus2$atchctr)

#Turning 'atchctr' into factor with 2 levels:

rus2$atchctr = ifelse(rus2$atchctr == 0|rus2$atchctr == 1|rus2$atchctr == 2|rus2$atchctr == 3|rus2$atchctr == 4, "Not attached", "Attached")

rus2$atchctr <- as.factor(rus2$atchctr)

rus2 <- rus2 %>% filter(!is.na(atchctr))
ggplot(data = rus2, aes(x = atchctr, y = nwspol)) +
  geom_boxplot(na.rm = T, color = "black", fill = "lightblue") +
  theme_bw() +
  stat_summary(fun.y = mean, na.rm = T, geom = "errorbar", aes(ymax = ..y.., ymin = ..y..), width = .75, linetype = "dashed") +
  labs(title = "Time spent on political news and current affairs\nconsidering emotional attachment to the native country", x = "Emotional attachment", y = "Time, min") +
  scale_y_continuous(breaks = 0:15*5)

As can be seen from the boxplot, there is indeed a difference in time spent on political news for people with different degree of affiliation to the country. So, people who feel attached to the country spend more time on news comparing to people not attached to the country.

We have already seen the distribution of data of the continuous variable we are working with, so it is time for t-test.

Conducting t-test

t.test(rus2$nwspol ~ rus2$atchctr)
## 
##  Welch Two Sample t-test
## 
## data:  rus2$nwspol by rus2$atchctr
## t = 5.1632, df = 279.41, p-value = 4.619e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.022854 4.515735
## sample estimates:
##     mean in group Attached mean in group Not attached 
##                   13.61031                   10.34101

Conclusion

On average, people who feel emotionally attached to the native country (Russia, in our case) spend about 14 min on news about politics and current affairs, whereas who do not - about 10 min. The t-statistic t(279.41) = 5.2 (p-value = 4.619e-07 << 0.05). Thus, we can conclude that the difference is statistically significant and people with a greater sense of affiliation to the native country indeed spend more time engaged in politics and current affairs.

Conductung Mann-Whitney-Wilcoxon test

Now again Wilcoxon test to check the results.

wilcox.test(rus2$nwspol ~ rus2$atchctr)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  rus2$nwspol by rus2$atchctr
## W = 270120, p-value = 1.116e-06
## alternative hypothesis: true location shift is not equal to 0

As W = 270120, p-value = 1.116e-06, the results of the non-parametric test again confirmed the results of t-test.

T-test 3

Now we know that the degree of country affiliation is positively associated with the amount of time people spend on the news. But are there any ways to predict the presence of emotional attachment? What if the degree of affiliation is related to the age of the respondent?

Hypotheses

H0: The difference between the mean age of the attached and non-attached is not statistically significant.

H1: The difference between the mean age of the attached and non-attached is statistically significant.

Data visualization

ggplot(rus2, aes(x = atchctr, y = agea))+
  geom_boxplot(na.rm = T, color = "black", fill = "lightblue")+
  theme_bw()+
  stat_summary(fun.y = mean, geom = "point",shape = 4,col = "red", na.rm = T)+
  labs(title = "Age of the respondent considering emotional attachment level", x = "Emotional attachment", y = "Age")+
  scale_y_continuous(breaks = 0:20*5)

Here we can see that mean age for people with different degree of affiliation to the country is different. Thus, the mean age of the attached is around 30 while the mean age of a not attached person is around 25.

Checking normality

3.1 Checking normality via histogram:

ggplot(rus2, aes(x = agea))+
  geom_histogram(binwidth = 1,color = "black", na.rm = T, fill = "lightblue")+
  labs(title = "Age of the respondents", x = "Age", y = "Frequency of this age")+
  scale_x_continuous(breaks = 0:20*5)+
  theme_bw()

From the histogram, the distribution looks like multimodal one.

3.2 Checking normality via qqplot:

qqnorm(rus2$agea); qqline(rus2$agea, col = 2)

From the qqplot it can be seen that the data doesn`t approximate the red line, so we can conclude that the distribution is indeed non-normal. Still, we can apply t-test, as the sample size is large enough.

Conductung t-test

t.test(rus2$agea ~ rus2$atchctr)
## 
##  Welch Two Sample t-test
## 
## data:  rus2$agea by rus2$atchctr
## t = 2.4858, df = 268.62, p-value = 0.01354
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.6465976 5.5720850
## sample estimates:
##     mean in group Attached mean in group Not attached 
##                   32.01432                   28.90498

Conclusion

It can be deduced from the t-test statistics that people who are older tend to feel a greater affiliation to the country (Mean age = 32), then those who are younger (Mean age = 29). This difference appeared to be significant, since the t-statistic t(268.62) = 2.5 (p-value = 0.01354 < 0.05).

Conductung Mann-Whitney-Wilcoxon test

Let us double check our results with Wilcoxon Test third time.

wilcox.test(rus2$agea ~ rus2$atchctr)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  rus2$agea by rus2$atchctr
## W = 264800, p-value = 0.008766
## alternative hypothesis: true location shift is not equal to 0

The Wilcoxon rank sum W = 264800, p-value = 0.008766. Thus, we may say that the mean age of the respondents is different depending on the degree of country affiliation (the same conclusion as with t-test).

Ending

Thanks for your attention!