##Introduction to the topic

In this research, our team will try to figure out how important the role of the media is in the life of French society. Recently, the question of political activism in France has been very acute. In connection with the protests of the “yellow vests” that have begun, the question of the formation of public opinion does not lose its relevance. Using the following statistical tests, we would like to test the hypotheses that were formed in our group when analyzing the database, based on the results of a survey on this topic. For analysis, we used a number of statistical tests, as well as graphs to visualize the results obtained.

For each part of the study will be concluded. Also, hypotheses will be given in the course of the test.

##PROJECT 1 - FIRST DATA VISUALISATION

##Summarized table

knitr::kable(df, caption = 'Table with kable')

Table with kable
Variable	Description	NOIR	ContinuousOrDiscrete	QualitativeOrQuantitative
nwspol	News about politics and current affairs, watching, reading or listening, minutes	Ratio	Continuous	Quantitative
netusoft	Internet use, how often	Ordinal	Discrete	Qualitative
netustm	Internet use, how much time on typical day, minutes	Ratio	Continuous	Quantitative
ppltrst	Most people can be trusted or you can’t be too careful	Interval	Discrete	Quantitative
pplfair	Most people try to take advantage of you, or try to be fair	Interval	Discrete	Quantitative
gender	Respondent’s gender	Nominal	Discrete	Qualitative
age	Respondent’s age	Ratio	Continuous	Quantitative
idno	Respondent’s ID	Nominal	Continuous	Quantitative

##Graph 1: Time spending for political news and current affairs

This graph shows the time spent by French citizens watching political news for 2016.

Hypotheses: Most TV viewers watch TV no more than 2 hours a day, as the priority sources of information shift towards the Internet. This should be submitted for later clarification of the popularity of online sources.

The graph shows that the highest rates have time intervals from 0 minutes to 1 hour, which confirms the hypothesis given above. This is justified by the popularity of entertainment content or the large use of the Internet for communication in social networks.

ggplot() + 
  geom_histogram(data = ESS1, aes(x = nwspol), binwidth = 10, fill="grey", col="black", alpha = 0.5) +
  xlim(0, 150) +
  xlab("News about politics and current affairs(min/per day") +
  ylab("Number of people") +
  ggtitle("How much time people spend on political news and current affairs")

## Warning: Removed 97 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

##Graph 2: How the french evaluate others between selfishness and fairness

The graph given below represents the French opinion about the honesty and self-interest of other people in general.

Hypothesis: The French have an average level of trust (values, closer to the median of the scale) in relation to other people, because they understand that people interact to obtain practical benefits.

The data presented here is necessary in order to compare people’s trust in those around them and the trust in social media and the information they produce.

The hypothesis was partially confirmed - the French have an average (closer to the median) confidence as the most popular indicator, however, those who go above 5 and 6 also have a big indicator.

This may be due to the fact that the French are inclined to believe that the society in which they live more fairly to them and will not deceive.

filter(ESS1, pplfair != 77 | pplfair != 88 | pplfair != 99)

barplot = barplot %>% na.omit()
ggplot() +
  geom_bar(data = ESS1, aes(x = pplfair), col = "black", fill = "gray", alpha = 0.5) +
  xlab("Evaluation of fairness") +
  ylab("Number of people") +
  ggtitle("How the French evaluate others between selfishness and fairness")

##Graph 3: Gender and time spent in the internet

Boxplot provides information about who spends more time on the Internet - men or women. Hypotheses: Men and women spend the same amount of time every day, since the Internet in developed countries such as France is very common.

Also, leisure time or viewing information on the Internet does not have a preferred gender.

This schedule is intended to demonstrate that trust in information does not make sense to be divided by gender and depending on the time spent on the Internet.

Conclusions from the chart: The median use of the Internet by men and women showed that gender does not affect the amount of time spent on the Internet per day. Despite the fact that men have a greater variation in time - the median value also does not change.

# Plotting time spent on news 
ggplot() + 
geom_histogram(data = ESS1, aes(x = nwspol), binwidth = 10, fill="grey", col="black", alpha = 0.5) + 
xlim(0, 150) + 
xlab("News about politics and current affairs(min/per day") + 
ylab("Number of people") + 
ggtitle("How much time people spend on political news and current affairs")

## Warning: Removed 97 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

##Graph 4: Correlation between age and time spent on the Internet

Hypothesis: older People spend less time on the Internet, because they are less adaptable to new technologies and prefer familiar media sourses.

To test the hypothesis, let’s construct the scatterplot.

The graph shows that after 60 years, the time spent on the Internet is reduced. The majority of regular Internet users in France are people under 60.

filter(ESS1, netustm != 6666 | netustm != 7777 | netustm != 8888 | netustm != 9999)

filter(ESS1, agea != 999)

# Plotting gender variable 
ggplot() + 
geom_boxplot(data = ESS1, aes(x = gndr, y = netustm), col = "black", fill = "gray") + 
xlab("Gender") + 
ylab("Time spending on the Internet (min/1 day)") + 
ggtitle("Distribution of Internet using per gender") + 
theme_bw()

##Graph 5: Correlation between Internet use frequency and social trust

Bar-chart illustrates the correlation between the frequency of using the Internet and gender.

Hypothesis: Men and women use The Internet with the same frequency.

This chart provides us with important information Internet use frequency and allows to understand if French males or females use The Internet more than the opposite gender. Such information might be useful for futher analisys of people’s activities in France

Conclusions:Our hypothesis does not confirm. The distribution is a bit unequal. As the graph shows, females in France use The Internet a little more than males.

# Bar plot (Internet and gender) 
bar_data = ESS1 %>% select(netusoft, gndr) %>% group_by(netusoft, gndr) %>% summarise(number = n()) 
bar_data$gndr = factor(bar_data$gndr, levels = c("Female", "Male")) 
ggplot(data = bar_data, aes(x = netusoft, y = number, fill = gndr)) + 
geom_bar(stat = "identity") + 
xlab("Internet use, how often") + 
ylab("Number of people") + 
ggtitle("Distribution of Internet use frequency per gender") + 
scale_fill_brewer(palette = "Set1") + 
geom_text(aes(label = bar_data$number), color = "black", position = position_dodge(1), vjust = -0.2, size = 4) + 
facet_grid(~gndr) + 
theme(legend.position = "none")

#PROJECT 2 - CHI-SQUARED AND T-TEST

In this paper, our team conducted an analysis of social media research data in France again. For this we needed a certain data array, with which we worked. The results of the analysis and their visualization will be shown here.

To begin with, we need to load the database and libraries we need, whose toolkit will launch our work.

Next, our team created a separate variable to convert the data we need into one frame.

After checking the database, we identified the following variables, which will be useful to us for analysis:

Next, we cleaned our data from useless values (such as “NA”) and tranformed some variables.

##Data visualisation

Barplot of age distribution inside frequency of internet use

The first thing to look at is whether a person was born in France or he immigrated here from another country. This allows us to trace the connection between the possibility of using the Internet between those who were born in France and those who were not born here. We assume that migrants use the Internet, but far fewer indigenous people. This may be due to different social or economic indicators.

This chart shows (native/not native) groups and their distribution in accordance with the frequency of using the Internet to obtain information.

We should filter the data and remove the respondents’ answers from the presented variables, which will not give us anything in our analysis (answers encoded as 7, 8 and 9).(Check the code above).

ggplot() +
geom_bar(data = ESSpr2, aes(x = netusoft, fill = brncntr)) +
scale_fill_manual(values = c("deeppink2","yellow3"))+
guides(fill=guide_legend(title="Born in country"))+
theme(axis.text.x = element_text(angle= 20))+
xlab("Internet use, how often") +
ylab("Number of people") +
ggtitle("How often citizens and migrants use the Internet")

Results

The graph shows that the proportion of migrants who are surveyed is small compared with the proportion of indigenous people. However, among all categories, the “every day” option leads among a group of non-indigenous people. To make more accurate conclusions - we need more accurate indicators. They will be presented in the graph below.

SJP package and more accurate representation in percentage terms

Then we should have shown a quantitative indicator as part of the answers to the question about the frequency of using the Internet. This graph is appropriate to interpret the data better than the previous one.

data(ESSpr2)

## Warning in data(ESSpr2): data set 'ESSpr2' not found

set_theme(geom.label.angle = 90)
sjp.xtab(ESSpr2$netusoft, ESSpr2$brncntr, vjust = "center", hjust = "bottom", show.prc = FALSE, geom.colors = c("deeppink2", "yellow3", "blue3"), string.total = "Total", axis.titles = "Internet use, how often", legend.title = "Born in country", title = "How often citizens and migrants use the Internet")

Results According to the results of the plot, we can observe that as a percentage, migrants who use the Internet not every day are in the lead, while indigenous people have the opportunity to use the Internet every day. Note: The columns in the chart do not reflect the number of respondents of the relevant category, but only their percentage share within their category.

##Chi-square test For the chi-square test assumptions must be checked first: 1* Raw data counted not in per cent - YES 2* Sample is large enough for chi-square test - YES - it consists of more than 5 observations 3* Observation are independed - YES - each observation belongs to one group (the groups are independent) 4* The variables are categorical - YES

Check all assumptions, we can proceed to the chi-square test.

This chi-square test is used to test the following hypotheses:

H0: The frequency of using the Internet is independent in relation to the “born in country” status.

H1: The frequency of using the Internet is dependent in relation to the “born in country” status.

table <- table(ESSpr2$brncntr, ESSpr2$netusoft)
kable(table)

	Never	Only occasionally	A few times a week	Most days	Every day
Yes	363	113	90	161	1130
No	36	16	19	28	113

table

##      
##       Never Only occasionally A few times a week Most days Every day
##   Yes   363               113                 90       161      1130
##   No     36                16                 19        28       113

c.test <- chisq.test(table)
c.test

## 
##  Pearson's Chi-squared test
## 
## data:  table
## X-squared = 13.514, df = 4, p-value = 0.009018

Results

Since p-value < 0.05, the dependencies between these two variables can be detected. The null hypothesis can be rejected. That means that the connection between a local resident or a visitor and the frequency of using the Internet cannot be random.

##Analysis of residuals

A relation between born-in-country status and internet use

expected<-matrix(c.test$expected, nrow = 2)
colnames(expected)<-c("Never", "Only occasionally", "A few times a week", "Most days", "Every day")
rownames(expected)<-c("Yes", "No")

observed<-matrix(c.test$observed, nrow = 2)
colnames(observed)<-c("Never", "Only occasionally", "A few times a week", "Most days", "Every day")
rownames(observed)<-c("Yes", "No")


kable(c.test$stdres, caption = "Residuals")

Residuals
	Never	Only occasionally	A few times a week	Most days	Every day
Yes	0.8973358	-0.8341256	-2.541327	-2.172577	2.126253
No	-0.8973358	0.8341256	2.541327	2.172577	-2.126253

assocplot(t(table), col = c("deeppink2","yellow3"), xlab = "Frequency of Internet use", ylab="Born in country", main = "Relation between born-in-county status and internet use")

For those born in the country, observed values are lower than expected for average measurements (only occasionally, a few times a week and most days) and higher for extreme measurements (never and every day)

For those who are not French-born, the ratio of observed and expected values is quite the opposite.

corrplot(c.test$stdres, is.cor=FALSE, main = "Residuals")

Blue cells - a positive relation between born-in-country status and internet use. Red cells - a negative relation between born-in-country status and internet use. The most deviating from 0 indicators can be observed among those who use the Internet a few times a week, while those who use the Internet only occasionally the most privigen to 0.

Results

Since p-value > 0.05, the dependencies between these two variables can not be detected. The null hypothesis is not rejected.

##T-TEST

Boxplot which connected with T-Test (variables: gender and nwspol)

Here is a boxplot, reflecting views of political news and gender. It should show us the median values, the average range of values, showing the time spent watching political news by men and women.

ggplot(ESSpr2, aes(x = gndr, y = nwspol, fill = gndr)) + 
geom_boxplot() +
scale_fill_manual(values=c("deeppink2", "yellow3"))+
xlab("Gender") +
ylab("Viewing of political news") +
guides(col = guide_legend(title="Gender"))+
ggtitle("Gender and interest of political news") +
theme_bw()

## Warning: Removed 4 rows containing non-finite values (stat_boxplot).

Results

Based on the resulting graph, you can see that the maximum value, the minimum value, the range of average values and the median value - are located at about the same level. To check the statistical significance between two gender groups - you should check it with the help of t-test. But before that, you need to check its normality.

Checking normality

Average time which people spends on political news and current affairs 2

To show how the average value of time spent on viewing political news is distributed - we used a histogram, combining several values that are not separated by a large gap in one column.

Also, types of jackets have been removed that will not give us any information (such as “NA”).

ESSpr2$nwspol <- as.numeric(ESSpr2$nwspol)
ggplot() + 
geom_histogram(data = ESSpr2, aes(x = nwspol), binwidth = 4, fill="deeppink2", col="black", alpha = 0.9) +
xlab("News about politics and current affairs(min/per day") +
ylab("Number of people") +
ggtitle("How much time people spend on political news and current affairs")

## Warning: Removed 4 rows containing non-finite values (stat_bin).

Results

On the chart, you can see that most people in France watch political news no more than from 0 to about 26 minutes a day. This data may tell us that people in France do not spend a lot of time watching political news.

Checking normality in 2 ways

QQ - plot (1st way)

This Q-Q Plots were built to test normality. To make sure of the reliability of our following conclusions - you should check their normality in various ways. The first will be the Q-Q Plot. To check the two groups (men and women), which we inetrested in this work it is necessary to create for each separate plots. Below is the corresponding code and graph.

female <- subset(ESSpr2, ESSpr2$gndr == "Female")
male <- subset(ESSpr2, ESSpr2$gndr == "Male")
qqnorm(as.numeric(female$nwspol)); qqline(as.numeric(female$nwspol, col = 2))

qqnorm(as.numeric(male$nwspol)); qqline(as.numeric(male$nwspol, col = 2))

Results

In both graphs, the distribution line shows that the data for both females and males are distributed normally. To confirm this conclusion, you should double-check the data in the second way (Variance-test).

Variance-test (2nd way)

For rechecking, the variance should be evaluated. The extent to which they are equal and close to normal. This test is designed to verify the uniformity of dyspresia. It is also verified according to the zero and first hypothesis.

H0: Variances of relation between gender and time which spent on political news are not different H1: Variances of relation between gender and time which spent on political news are different

var.test(ESSpr2$nwspol ~ ESSpr2$gndr)

## 
##  F test to compare two variances
## 
## data:  ESSpr2$nwspol by ESSpr2$gndr
## F = 0.98549, num df = 950, denom df = 1114, p-value = 0.8163
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.8721483 1.1142250
## sample estimates:
## ratio of variances 
##           0.985486

Results

Having received this p-value, we cannot reject the null hypothesis, since test results show that p-value is equal 0.8163. That means that the variances of time which spent on political news and gender are not different. We have two options for follow-up testing: using the Wilcoxon Test and with T-Test. That results give us an opportunity to make our test more accurate with T-test (unlike Wilcoxon test).

T-test

Before doing t-test, you need to make two assumptions. We need the variance of the two samples to be the same and that there is a normal distribution. We have assumptions and that is why we can do the t-test as the next stage. In this test, we compare the differences between the mean values of the gender ratio and the variable that indicates the amount of time devoted to political news.

Based on the previous graphs, we

put forward the following hypotheses:

H0: The difference between males and females in relation to time spent on political news is not statistically significant.

H1: The difference between males and females in relation to time spent on political news is statistically significant.

These hypotheses were advanced on the assumption that women in Europe are presumably less politically active than men. It can be observed as a gender stereotypes and gender gap in all way of socialization.

ESSpr2$nwspol <- as.numeric(ESSpr2$nwspol)
t.test(ESSpr2$nwspol ~ ESSpr2$gndr, var.equal = T)

## 
##  Two Sample t-test
## 
## data:  ESSpr2$nwspol by ESSpr2$gndr
## t = 1.6351, df = 2064, p-value = 0.1022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1058539  1.1678109
## sample estimates:
##   mean in group Male mean in group Female 
##             12.18927             11.65830

Results

P-value is greater than 0.05. That is why we do not reject the null hypothesis.

As a result, we want to be sure that we did everything correctly, therefore, we do an additional test (Wilcoxon test).

Wilcoxon test (non-parametric checking of normality)

As it was mentioned before, we need to be sure that the t-test results are correct and therefore we do a second test using the Wilcoxon-test. We put forward the same hypotheses as in t-test.

H0: The difference between mean values of males and females in relation to time spent on political news is not statistically significant.

H1: The difference between mean values of males and females in relation to time spent on political news is statistically significant.

wilcox.test(ESSpr2$nwspol ~ ESSpr2$gndr)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  ESSpr2$nwspol by ESSpr2$gndr
## W = 553900, p-value = 0.07557
## alternative hypothesis: true location shift is not equal to 0

Results

Also, p-value is bigger than 0.05. That is why we cannot reject the null-hypothesis as it follows from our previous T-test too.

This review showed us that there is a gap between men and women associated with political interest. This can serve as a basis for theoretical hypotheses, however, we have an assumption that this is due to the prejudices associated with gender.

##Distribution of tasks within the team: Maria and Alexander were engaged in t-test, Linara and Yuri- Chi-squared test test. We rechecked the pieces of our partners ’ code and determined the analysis vector in a separate meeting together.

##PROJECT 3 - ANOVA

Fitst of all, we downloaded databases and libraries. After checking the database, we identified the following variables, which will be useful to us for analysis. Next, we cleaned our data from useless values and tranformed ppltrst variable.

##Variables

For this analysis we took the following variables: nwspol - how much time people spent on political news (in minutes). This is numeric (ratio) variable. ppltrust - people think they can trust others or should be careful. We divided this variable into 4 categories to use in our analysis:No trust, More likely not to trust, More likely to trust, Trusting

##Data visualisation

We use boxplot to look at our grouped data.It shows ?trust? groups and their distribution in accordance with the number of time people spent on political news.

ggplot() +
  geom_boxplot(data = ESSpr3, aes(x = trust, y = nwspol),, col = "black", fill = "gray") + 
  xlab("Gender") + 
  ylab("Time spending on the Internet (min/1 day)") + 
  ggtitle("Correlation between gender and time spendidng on the Internet") + 
  theme_bw()

Based on the resulting graph, it is seen that the maximum value, the minimum value (exept No trust group), the range of average values and the median value - are located at about the same level. To check the difference in means we will use oneway test.

##Checking the assumptions Before applying oneway teat will check assumptions. 1. Independence of observations (because of sourse of data) 2. Normal distribution of residuals (is checked after analysis) 3. Equal variances (is checked by Levene’s test)

Levene’s Test

H0 = Variance for our four “trust” groups is equal. H1 = Variance differ for our four “trust” groups.

leveneTest(ESSpr3$nwspol ~ ESSpr3$trust)

Results

From the output above we can see that the p-value = 0.4129 that is not less than the significance level of 0.05. This means that there is no evidence to suggest that the variance across groups is statistically significantly different. We can indicate in the ANOVA test that var.equal = T. Yay!

##Oneway test Oneway test is used to check the difference in means between 4 “trust” groups. Applying this test we will know is the average time spend of political news deffer among “trust” groups

###Hypothesis for oneway test We assume that there is a difference between people who trust and distrust others and the amount of time they devote to news about politics.

H0 = There is no difference in means between groups in spending time on political news H1 = There is a difference in means between at least in one pair of groups in spending time on political news

oneway.test(ESSpr3$nwspol ~ ESSpr3$trust, var.equal = T)

## 
##  One-way analysis of means
## 
## data:  ESSpr3$nwspol and ESSpr3$trust
## F = 6.6392, num df = 3, denom df = 2062, p-value = 0.0001844

The output shows that p-value is much less than 0.05 significance level, so we have evidence to reject the null hypithesis. F-ratio = 6.6392 also indicates that there is an evidence to reject the null hypothesis. Moreover, F-ratio > 1 which means that it is significant and the variation among group means are significant. Yahoo!

###Outliers Next function also helps to know is there any significant difference between the average time spent on political news in the 4 “trust” groups.

aov.out <- aov(ESSpr3$nwspol ~ ESSpr3$trust) 
summary(aov.out)

##                Df Sum Sq Mean Sq F value   Pr(>F)    
## ESSpr3$trust    3   1070   356.7   6.639 0.000184 ***
## Residuals    2062 110781    53.7                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From this output it is seen that results confirm oneway test results.

###Checking the normality of residuals Now we will check normality of residuals (one of assumption of ANOVA which easier to check after test)

layout(matrix(1:4, 2, 2))
plot(aov.out)

This graphs show that residuals are dictributed almost normally, but there are small deviations which are most noticable on Q-Q plot.

Then we conduct a second check the normality of residuals. We want to make sure that skew (i.e. degree of distortion) and kurtosis( i.e. measure of outliers) takes on values that satisfy us.

anova.res <- residuals(object = aov.out)
describe(anova.res)

As it can be seen, skew is less than 2 and kurtosis is less than 2. It can be concluded that residuals distributed normally.

Next step is ShapiroвЂ“Wilk test!

##Shapiro-Wilk test

HO: the residuals are normally distributed H1: the residuals are not normally distributed

shapiro.test(x = anova.res) # if the p-value is > .05, the distribution IS normal.

## 
##  Shapiro-Wilk normality test
## 
## data:  anova.res
## W = 0.94801, p-value < 2.2e-16

P-value is much less than 0.05. The null hypothesis is rejected and there is evidence that the data being tested is not distributed normally.

For clarity, we visualize our results using a histogram.

hist(anova.res)

##Tukey’s test

ANOVA does not give clear results about which groups differ. We need to compare them in pairs. So we conduct Tukey’s honest significance test or Tukey’s HSD test and then visualize results.

We start with hypothesis! HO: two means are equal H1: two means are differ

TukeyHSD(aov.out)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = ESSpr3$nwspol ~ ESSpr3$trust)
## 
## $`ESSpr3$trust`
##                                                      diff        lwr
## More likely  to trust-More likely not to trust  0.4543531 -0.6408543
## No trust-More likely not to trust              -1.3094862 -2.3186146
## Trusting-More likely not to trust               0.5663720 -1.0047885
## No trust-More likely  to trust                 -1.7638393 -2.9313320
## Trusting-More likely  to trust                  0.1120190 -1.5652540
## Trusting-No trust                               1.8758582  0.2534817
##                                                       upr     p adj
## More likely  to trust-More likely not to trust  1.5495604 0.7099237
## No trust-More likely not to trust              -0.3003578 0.0047873
## Trusting-More likely not to trust               2.1375326 0.7904686
## No trust-More likely  to trust                 -0.5963465 0.0006117
## Trusting-More likely  to trust                  1.7892920 0.9982007
## Trusting-No trust                               3.4982348 0.0157871

We observe three pairs with p-value less than 0.05: 1. No trust - More likely not to trust 2. No trust - More likely to trust 3. Trusting - No trust It means that Observed results is independent inside their groups.

par(mar = c(5, 18, 3, 1))
Tukey <- TukeyHSD(aov.out)
plot(Tukey, las = 2)

The graph confirms the text above.

##Kruskal test

Next, we want to take a look on non-parametric ANOVA i.e. the Kruskal-Wallis test.

H0: mean ranks of the groups are the same. H1: at least one mean rank of one group is different from another

kruskal.test(nwspol ~ trust, data = ESSpr3)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  nwspol by trust
## Kruskal-Wallis chi-squared = 22.036, df = 3, p-value = 6.412e-05

Test shows that p-value is less than 0.05. Null hypothesis is rejected. There is a difference in mean ranks at least in one pair. It should be emphasized that ANOVA showed same result.

##Dunn test

Basing on the Kruskal-Wallis test results, we conduct DunnвЂ™s test (i.e.non-parametric post hoc test).

library(DescTools)
DunnTest(nwspol ~ trust, data = ESSpr3)

## 
##  Dunn's test of multiple comparisons using rank sums : holm  
## 
##                                                mean.rank.diff    pval    
## More likely  to trust-More likely not to trust       42.04527 0.65801    
## No trust-More likely not to trust                  -106.21699 0.00379 ** 
## Trusting-More likely not to trust                    59.49364 0.65801    
## No trust-More likely  to trust                     -148.26225 0.00029 ***
## Trusting-More likely  to trust                       17.44838 0.73926    
## Trusting-No trust                                   165.71063 0.00433 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

P-value is less than 0.005 on thee levels (Hooray!): 1) No trust - More likely not to trust 2) No trust - More likely to trust 3) Trusting - No trust

This fact shows us that these categories more likely to differ among themselves. Also, these differences are significant.

##Conclusion

We conducted a statistical analysis using ANOVA test. The results show that there is a statistical difference in viewing political news between different groups categorized by the level of trusting people. A group of respondents who tend not to trust people around (group marked by label “No trust”) spend less time watching the political news.

#Project №4

##Introduction

Politics today attracts a lot of public attention. Many people focus their interest on it. This theme comes into contact with the aspects of the show, intrigue and proximity to the real fate of the population. Our work is connected with the analysis of politics in the media space. Political news is one of the most popular sections of television and other media resources. But do they pass in the background? Or people are really keen on them consciously? What people tend to spend more time watching political news? These issues are extremely relevant in the context of modern sociology.

## Loading required package: boot

## 
## Attaching package: 'boot'

## The following object is masked from 'package:psych':
## 
##     logit

## The following object is masked from 'package:car':
## 
##     logit

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

## 
## Attaching package: 'QuantPsyc'

## The following objects are masked from 'package:DescTools':
## 
##     Kurt, Skew

## The following object is masked from 'package:base':
## 
##     norm

Research question and hypotheses

Our research team is very interested in determining the factors that influence the time spent on the news. We establish the following research question: Does a person’s age and level of interest in politics influence the amount of time he/she spends on political news? Based on the research question and the variables that are available to us, we put forward two research hypotheses. 1. Interest in politics affects the amount of time a person spends on political news.

Person’s age affects the amount of time he/she spends on political news

#Variables

We used 2 predictor variables in our model: 1)Respondent’s age in years (agea) 2)Respondent’s level of interest in politics (polintr)

The dependent variable is time spent on political news in minutes (nwspol).

#Descriptive statistics

The interest in politics variable is divided into 4 categories, where 1 - Very intrested and 4 - Not intrested at all. We visualize the distribution of this variable using a histogram.

ggplot(data = ESSpr4, aes (x = polintr), binwidth = 4) +
  geom_bar() +
  xlab("Interest in politics") +
  ylab("Number of respondents") +
  ggtitle("How people intrested in politics")

On the graph we can see that the distribution is normal.

Visualize in a similar way the following variable - age.

ggplot(data = ESSpr4, aes (x = agea), binwidth = 4) +
  geom_histogram() +
  xlab("Age") +
  ylab("Number of respondents") +
  ggtitle("Distribution of age of respondents")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Distribution is… quite normal!

Finally, our dependent variable is here! This variable has its own specifics because of the selected metric. Respondents were asked to indicate the time spent in minutes. And respondents were inclined to call the time a multiple of 5 or 10. For example, many will say that tatt a day for 20 minutes, not 21. Therefore, the distribution will be slightly skewed. But we still identify it as normal.

ggplot(data = ESSpr4, aes (x = nwspol)) +
  geom_histogram() +
  xlab("Time respondent spent on political news, in minutes") +
  ylab("Number of respondents") +
  ggtitle("Distribution of time respontents spent on political news, in minutes")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Distribution of time respontents spent on political news is presented in the graph below.

ggplot(ESSpr4, aes(x = agea, y = nwspol))+
geom_jitter()+
geom_smooth(method = lm)+
  xlab("Respondent's age") +
  ylab("Amount of time respondent spent on political news, in minutes") +
  ggtitle("Distribution of time respontents spent on political news, in minutes")

With the help of the blue line we can observe a positive relationship between age and time on political news. The older the Respondent, the more time he spent on political news.

We provide two variables correlation test. H0: correlation is equal to 0 H1: correlation is not equal to 0

cor.test(ESSpr4$agea, ESSpr4$nwspol)

## 
##  Pearson's product-moment correlation
## 
## data:  ESSpr4$agea and ESSpr4$nwspol
## t = 15.937, df = 2062, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2921876 0.3690318
## sample estimates:
##       cor 
## 0.3311587

From the output it is clear that p-value is smaller than 2.2e-16 and we should reject HO. Test shows positivw correlation with index 0.3311587

Moving to another couple of variables, we produce the same analysis.

ggplot(ESSpr4, aes(x = int, y = nwspol)) +
 geom_boxplot() +
 labs(x = "Interest in politics", y = "Time respondent spent on political news, in minutes", title = "Amount of time respondent spent on political news depending on interest in politics")

The graph shows that people who are interested in politics watch more political news, and Vice versa.

Correlation test gives us the same result. It found negative correlation (r = -0.3539419)

cor(ESSpr4$polintr, ESSpr4$nwspol, method = c("pearson", "kendall", "spearman"))

## [1] -0.3539419

#Model creating

###First model

model1 <- lm(nwspol ~ polintr, data = ESSpr4) 
summary(model1)

## 
## Call:
## lm(formula = nwspol ~ polintr, data = ESSpr4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.091  -4.796  -1.404   3.969  32.969 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.7775     0.4277   43.90   <2e-16 ***
## polintr      -2.6865     0.1563  -17.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.888 on 2062 degrees of freedom
## Multiple R-squared:  0.1253, Adjusted R-squared:  0.1249 
## F-statistic: 295.3 on 1 and 2062 DF,  p-value: < 2.2e-16

tab_model(model1)

	nwspol
Predictors	Estimates	CI	p
(Intercept)	18.78	17.94 – 19.62	<0.001
polintr	-2.69	-2.99 – -2.38	<0.001
Observations	2064
R² / adjusted R²	0.125 / 0.125

We should interpret the results of the table. The Intercept equal to 18.78 can be conventionally designated as amount of time spent on political news per day of the person, who is very intrested in politics. If we moving in level on interest, the rates will fall on -2.69. P-values smaller than 1% allow us to state that coefficients are not equal to 0. Multiple R-squared is equal to 0.125. It means that this model explained about 13 per cent of all cases.

###Second model

Now, we will check second mobel which is based on age.

model2 <- lm(nwspol ~ agea, data = ESSpr4) 
summary(model2)

## 
## Call:
## lm(formula = nwspol ~ agea, data = ESSpr4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.883  -4.993  -1.144   4.270  33.594 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.956087   0.346158   20.09   <2e-16 ***
## agea        0.128917   0.008089   15.94   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.949 on 2062 degrees of freedom
## Multiple R-squared:  0.1097, Adjusted R-squared:  0.1092 
## F-statistic:   254 on 1 and 2062 DF,  p-value: < 2.2e-16

tab_model(model2)

	nwspol
Predictors	Estimates	CI	p
(Intercept)	6.96	6.28 – 7.63	<0.001
agea	0.13	0.11 – 0.14	<0.001
Observations	2064
R² / adjusted R²	0.110 / 0.109

In this model we can conditionally say that increasing the age by one year, the time for political news every day will increase for 0.13 min. This model explained about 11 per cent of all cases.

###Third model

This model includes all previous variables.

model3 <- lm(nwspol ~ polintr + agea, data = ESSpr4) 
summary(model3)

## 
## Call:
## lm(formula = nwspol ~ polintr + agea, data = ESSpr4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.320  -4.575  -1.024   3.661  34.828 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.75978    0.51352   26.80   <2e-16 ***
## polintr     -2.52774    0.14800  -17.08   <2e-16 ***
## agea         0.12013    0.00759   15.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.505 on 2061 degrees of freedom
## Multiple R-squared:  0.2201, Adjusted R-squared:  0.2193 
## F-statistic: 290.8 on 2 and 2061 DF,  p-value: < 2.2e-16

tab_model(model3)

	nwspol
Predictors	Estimates	CI	p
(Intercept)	13.76	12.75 – 14.77	<0.001
polintr	-2.53	-2.82 – -2.24	<0.001
agea	0.12	0.11 – 0.14	<0.001
Observations	2064
R² / adjusted R²	0.220 / 0.219

This is the best predictive model in our analysis, and it explains about 22 per cent of cases.

##Residuals and normallity

Now we must to vizualise the results of our models, their distribution and observing of residuals

plot(model1)

Line on residuals graph shows that it is equal to zero and almost straight. It means that this model is good. Let’s go further.

plot(model2)

This model show us that it’s redults distributed normaly. Also, the red line on graphs is near with zero/ It means that this model is good, but let’s go find some better cases.

plot(model3)

Q-Q plots, residuals graph shows that our full model is also good as previous. These results allowed us to make the next step - compairing.

#ANOVA tests for comparing model with divided variables with final compaired model

On this plots we can see that the line in the “Residuals vs Fitted” graph is almost straight and this means that our calcualtions are not wrong.

However, we must to check the models between theirselves with ANOVA test.

H0 Reduced model has not big differences and as good as full model H1 Reduced model is worst than full model

anova(model1, model3)

anova(model2, model3)

Looking at this, we can tell that in both cases p-value is less than 0.05. It means that we can reject the null hypothesis which claims that reduced model is as good and conclude that addition of a predictor

#Next step Now we want to compare standardized regression coefficients for each variable and visualize it.

library(QuantPsyc)
lm.beta(model3)

##    polintr       agea 
## -0.3330224  0.3085824

##visualizing
library(sjPlot)
plot_model(model3, type = "std")

#Centering of model for multicollinearity

We need a step of centering predicted variables to eliminate the chance of intercept and it’s meaningless. After this step we could create a models with centered predictors.

ESSpr4$agea_centered <- ESSpr4$agea - mean(ESSpr4$agea)
ESSpr4$polintr_centered <- ESSpr4$polintr - mean(ESSpr4$polintr)
model4 <- lm(ESSpr4$nwspol ~ ESSpr4$agea_centered)
model5 <- lm(ESSpr4$nwspol ~ ESSpr4$polintr_centered)
model6 <- lm(ESSpr4$nwspol ~ ESSpr4$polintr_centered + ESSpr4$agea_centered)
summary(model4)

## 
## Call:
## lm(formula = ESSpr4$nwspol ~ ESSpr4$agea_centered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.883  -4.993  -1.144   4.270  33.594 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          11.905039   0.152951   77.84   <2e-16 ***
## ESSpr4$agea_centered  0.128917   0.008089   15.94   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.949 on 2062 degrees of freedom
## Multiple R-squared:  0.1097, Adjusted R-squared:  0.1092 
## F-statistic:   254 on 1 and 2062 DF,  p-value: < 2.2e-16

summary(model5)

## 
## Call:
## lm(formula = ESSpr4$nwspol ~ ESSpr4$polintr_centered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.091  -4.796  -1.404   3.969  32.969 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              11.9050     0.1516   78.53   <2e-16 ***
## ESSpr4$polintr_centered  -2.6865     0.1563  -17.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.888 on 2062 degrees of freedom
## Multiple R-squared:  0.1253, Adjusted R-squared:  0.1249 
## F-statistic: 295.3 on 1 and 2062 DF,  p-value: < 2.2e-16

summary(model6)

## 
## Call:
## lm(formula = ESSpr4$nwspol ~ ESSpr4$polintr_centered + ESSpr4$agea_centered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.320  -4.575  -1.024   3.661  34.828 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             11.90504    0.14319   83.14   <2e-16 ***
## ESSpr4$polintr_centered -2.52774    0.14800  -17.08   <2e-16 ***
## ESSpr4$agea_centered     0.12013    0.00759   15.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.505 on 2061 degrees of freedom
## Multiple R-squared:  0.2201, Adjusted R-squared:  0.2193 
## F-statistic: 290.8 on 2 and 2061 DF,  p-value: < 2.2e-16

Let’s check the quality of new models by ANOVA and plots.

plot(model4)

plot(model5)

plot(model6)

All these models as good as in previous case. Let’s check the best one with statistical test of ANOVA.

As usual, hypothesis are here: H0 - Reduced model has not big differences and as good as full model H1 - Reduced model is worst than full model

anova(model4, model6)

anova(model5, model6)

Results of the tests can be interpret such way that we can reject H0 as in first case.

#Comparing the effects within one model

Next function allow us to check the effect of predictors on output. As a result we can see that the age has more powerful effect on number of minutes which participants spent on viewing of political news.

lm.beta(model6)

## ESSpr4$polintr_centered    ESSpr4$agea_centered 
##              -0.3330224               0.3085824

##Visualizing of power of effect on variable “nwspol” (time which spent on political news)

plot_model(model6, type = "std")

library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.1. https://CRAN.R-project.org/package=stargazer

stargazer(model1, model4, model5, type = "text")

## 
## ================================================================
##                                       Dependent variable:       
##                                 --------------------------------
##                                   nwspol          nwspol        
##                                    (1)        (2)        (3)    
## ----------------------------------------------------------------
## polintr                         -2.687***                       
##                                  (0.156)                        
##                                                                 
## agea_centered                               0.129***            
##                                             (0.008)             
##                                                                 
## polintr_centered                                      -2.687*** 
##                                                        (0.156)  
##                                                                 
## Constant                        18.778***  11.905***  11.905*** 
##                                  (0.428)    (0.153)    (0.152)  
##                                                                 
## ----------------------------------------------------------------
## Observations                      2,064      2,064      2,064   
## R2                                0.125      0.110      0.125   
## Adjusted R2                       0.125      0.109      0.125   
## Residual Std. Error (df = 2062)   6.888      6.949      6.888   
## F Statistic (df = 1; 2062)      295.312*** 253.985*** 295.312***
## ================================================================
## Note:                                *p<0.1; **p<0.05; ***p<0.01

###Result of ANOVA Results of our ANOVA tests and R squared tell us that full models, which taking into account the predictors of age and the degree of political interests, explains the amount of time spent on viewing political news much more than limited models. Also, R squared gives the more level of observations and i case of analysis age of participant must be combined with quality of political interest of human.

##Interaction model

We create a new model that feautures the age of a respondent as one of the predictors, and make gender play a role of a moderator. In other words, we want to check if the coefficient between X (age of respondent) and Y (how much time respondent spent on political news) depends on the level of Z (gender).

describe(ESSpr4$agea)

We see that the respondent’s minimum age is 1 years old, but not 0. So we want our variable to be medium-centered or more to interpret the coefficients later. That is why we get the centered variable from the previous part of our project.

model7 <- lm(ESSpr4$nwspol ~ ESSpr4$polintr_centered + ESSpr4$agea_centered + ESSpr4$gndr, data = ESSpr4)
model8 <- lm(ESSpr4$nwspol ~ ESSpr4$polintr_centered + ESSpr4$agea_centered * ESSpr4$gndr, data = ESSpr4)
anova(model7, model8)

We can observe that the value of p is 0.02115, which is less than 0.05. This means that the interaction model is better suited for analysis and now it can be somehow interpreted.

tab_model(model7)

	ES Spr 4$nwspol</th> </tr> <tr> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">Predictors</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Estimates</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">CI</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">p</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">12.00</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">11.58 – 12.41</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">ES Spr 4$polintr centered			-2.51	-2.81 – -2.22	<0.001
ES Spr 4$agea centered	0.12	0.11 – 0.14	<0.001
Female	-0.17	-0.74 – 0.40	0.558
Observations	2064
R² / adjusted R²	0.220 / 0.219

tab_model(model8)

	ES Spr 4$nwspol</th> </tr> <tr> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">Predictors</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Estimates</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">CI</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">p</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">11.97</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">11.55 – 12.38</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">ES Spr 4$polintr centered			-2.52	-2.81 – -2.23	<0.001
ES Spr 4$agea centered</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.10</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.08 – 0.12</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">Female</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.16</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.73 – 0.41</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.582</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">ESSpr4$agea_centered:ESSpr4$gndrFemale	0.04	0.01 – 0.07	0.021
Observations	2064
R² / adjusted R²	0.222 / 0.221

After examining the adjusted square R, we can conclude that our interaction model explains 22% of the variance of the number of cigarettes smoked per day, and the additive model explains 21%.

The interception is 11.97. In the case when all predictors are equal to 0 (in the case of continuous variables, they are equal to the average, since they have an average centrality; in the case of the categorical variable - “Female”), the number of minutes spent on political news per day is 11.97. p value is not interpreted, because it is not significant.

Marginal effects show the expected change in the outcome associated with 1-unit change of a particular predictor, taking into account all the other predictors. Here they are!

#References and sources

Final project - DAS is Fantastish

Mariya Pronyuk, Linara Belorukova, Alexander Vilkhovenko, Yuriy Kukartsev

25.05.2019

Barplot of age distribution inside frequency of internet use

Results

SJP package and more accurate representation in percentage terms

Results

Results

Boxplot which connected with T-Test (variables: gender and nwspol)

Results

Checking normality

Average time which people spends on political news and current affairs 2

Results

Checking normality in 2 ways

QQ - plot (1st way)

Results

Variance-test (2nd way)

Results

T-test

Results

Wilcoxon test (non-parametric checking of normality)

Results

Levene’s Test

Results

Research question and hypotheses

	ES Spr 4\(nwspol</th> </tr> <tr> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">Predictors</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Estimates</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">CI</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">p</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">12.00</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">11.58 – 12.41</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">ES Spr 4\)polintr centered			-2.51	-2.81 – -2.22	<0.001
ES Spr 4$agea centered	0.12	0.11 – 0.14	<0.001
Female	-0.17	-0.74 – 0.40	0.558
Observations	2064
R² / adjusted R²	0.220 / 0.219