Introduction

Hello, we are 2BK team (Bakhareva, Borisenko, Kireeva, Kuzmicheva) and we are happy to demonstrate our project on linear regression modeling. We are going to analyze how different factors influence on the satisfaction with democracy in Ireland (round 8). As predictor variables we have chosen the following variables: trust to parliament, voting on last elections and trust to politicians. Why we have chosen exactly these ones, we will tell you further.

As for the contribution:

Anastasia Bakhareva: boxplot construction, graph analysis, barplot construction, linear regression models
Iana Borisenko: linear regression assumptions check & analysis, equation construction
Irina Kireeva: correlation coefficients analysis, scatterplot construction
Daria Kuzmicheva: histogram construction, graph analysis, comparing models with ANOVA test

Analysis of Background

Since our topic is politics, we tried to find some interesting articles on our topic to have an inspiration for further analysis of variables. So, we came up to the articles that told us the following:

Within the set of liberal democracies, the Nordic countries tend to have the highest trust rates, (and Ireland is actually a Nordic country), and the confidence of people in the government is of a general nature: a high level of trust in one institution tends to spread to other institutions, such as trust in parliament and overall satisfaction with democracy .
The presence of voting procedure results in higher trust to the chosen leader.

In our analysis, we selected variables that hold data about the level of trust in politicians and parliament in Ireland, as well as about participation in elections and the level of satisfaction with democracy. For these variables, we will build a mathematical model, which will help us to predict the value of the output variable based on one or more of the input predictor variables.

Our variables are:

Label <- c("`trstprl`", "`vote`", "`stfdem`", "`trstplt`" ) 
Meaning <- c("Trust to parliament", " Voting on last elections", "Satisfaction with democracy", "Trust to politicians")
Level_Of_Measurement <- c("Interval", "Nominal", "Interval", "Interval")
Measurement <- c("0 - 10","Yes / No", "0 - 10", "0 - 10")
df <- data.frame(Label, Meaning, Level_Of_Measurement, Measurement, stringsAsFactors = FALSE)
kable(df) %>% 
  kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)

Label	Meaning	Level_Of_Measurement	Measurement
`trstprl`	Trust to parliament	Interval	0 - 10
`vote`	Voting on last elections	Nominal	Yes / No
`stfdem`	Satisfaction with democracy	Interval	0 - 10
`trstplt`	Trust to politicians	Interval	0 - 10

Filtering the Data

ESS1_8e01 <- read_csv("/students/apbakhareva_1/11.09/ESS1-8e01.csv")
es1 = ESS1_8e01

es2 = es1 %>% 
  select(trstprl, vote, stfdem, trstplt)
es2 = es2 %>% 
  filter(trstprl != 77) %>% 
  filter(trstprl != 88) %>% 
  filter(trstprl != 99)
es2 = es2 %>% 
  filter(vote != 3) %>% 
  filter(vote != 7) %>%
  filter(vote != 8) %>%
  filter(vote != 9) 
es2 = es2 %>% 
  filter(stfdem != 77) %>%
  filter(stfdem != 88) %>% 
  filter(stfdem != 99) 
es2 = es2 %>% 
  filter(trstplt != 77) %>% 
  filter(trstplt != 88) %>% 
  filter(trstplt != 99) 

es2$vote = as.factor(es2$vote)

Exploring the data

So, first of all, we should have a glance on specifications of our dataset with the function summary.

summary(es2)

##     trstprl      vote         stfdem          trstplt      
##  Min.   : 0.00   1:1859   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 3.00   2: 523   1st Qu.: 4.000   1st Qu.: 2.000  
##  Median : 5.00            Median : 5.000   Median : 4.000  
##  Mean   : 4.46            Mean   : 5.352   Mean   : 3.742  
##  3rd Qu.: 6.00            3rd Qu.: 7.000   3rd Qu.: 5.000  
##  Max.   :10.00            Max.   :10.000   Max.   :10.000

Seems legit, now it is time to check for outliers. We surely can do this with the graphs.

Then, we need to understand our variables from our dataset graphically. For that we will need to create:

Box plot, to spot outliers observations in the variables.
Density plot, to check if our variables distribution is close to normal.
Scatter plot, to visualize the linear relationship between the variables

Using Boxplot to check for outliers

We construct boxplots as follows:

par(mfrow=c(1, 3))
boxplot(es2$trstprl, main="Trust in country's parliament", sub=paste("Outlier rows: ", boxplot.stats(es2$trstprl)$out))
boxplot(es2$trstplt, main="Trust in politicians", sub=paste("Outlier rows: ", boxplot.stats(es2$trstplt)$out))
boxplot(es2$stfdem, main="Satisfaction with democracy", sub=paste("Outlier rows: ", boxplot.stats(es2$stfdem)$out))

From the boxplot we can see that the Y variables are quite normally distributed among the groups. It also can be seen that there are virtually no outliers except for one point in “trust in politicians” (it can be found on line 10 in our dataset). Moreover, it can be seen that trust in politicians has the lowest median of level of trust.

Using Histograms to check if continuous variables are close to normal

par = ggplot(data = es2, aes(x = trstprl))  + geom_histogram(aes(y=..density..), position = "identity", alpha = 0.7, binwidth = 1, fill = "orange") + geom_density(col = "blue", fill = "white", alpha = 0.1) + xlab("Trust in parliament")
dem = ggplot(data = es2, aes(x = stfdem)) + geom_histogram(aes(y=..density..), position = "identity", alpha = 0.7, binwidth = 1, fill = "purple") + geom_density(col = "blue", fill = "white", alpha = 0.1) + xlab("Satisfaction with democracy")
polit = ggplot(data = es2, aes(x = trstplt)) + geom_histogram(aes(y=..density..), position = "identity", alpha = 0.7, binwidth = 1, fill = "grey") + geom_density(col = "blue", fill = "white", alpha = 0.1) + xlab("Trust in politicians")

plot_grid(par, polit, dem)

As it can be seen from the histograms, trust in parliament and satisfaction with democracy are slightly close to normal distribution. As for the trust in politicians, the histogram is not normally distributed. However, we can surely work with that.

Using Barplots to check if categorical variable is representative

library(scales)
ggplot(data = es2, aes(x = vote)) + geom_bar(aes(y = (..count..)/sum(..count..)), fill = "pink") + scale_y_continuous(labels=scales::percent) + ylab("relative frequencies") + ggtitle("Voting rates in Ireland")

The groups are of comparable size.
As it can be seen, 80% of irish have a habit to participate in elections.

Using Scatterplots to visualise the relationship

w = ggplot(data = es2, aes(x = trstprl, y = stfdem)) + geom_point() + geom_smooth(method = lm, fill="blue", color="blue", se = FALSE)  + ggtitle("Relationship between trust in parlment and satisfaction with democracy") + xlab("Trust in parlament") + ylab("Satisfaction with democracy")
w

e = ggplot(data = es2, aes(x = trstplt, y = stfdem)) + geom_point() + geom_smooth(method = lm, fill="blue", color="blue", se = FALSE)  + ggtitle("Relationship between trust in politicians and satisfaction with democracy") + xlab("Trust in politicians") + ylab("Satisfaction with democracy")
e

#plot_grid(w,e)

Our scatterplots show that:

there is a positive correlation between satisfaction with democracy and trust in parliament
there is a positive correlation between satisfaction with democracy and trust in politicians

Looking at correlation coefficients

We will have a look on them on this fine visualisation:

es3 = es2 %>% 
  select( -vote)
q = cor(es3)
sjp.corr(es3, show.legend = TRUE)

From what we can see, all the relationship between our variables are pretty decent and have positive direction.
Each of the correlation coefficient is close to 0.5 value
What is interesting, that the highest correlation coefficient is between trust to politicians and trust to parlament. The presented values confirm the situation on the scatterplots.

Conducting Linear Regression Models

Since we have seen the linear relationship pictorially in the scatter plot and by computing the correlation, it is time for model conduction.

Linear regression model with 1 predictor

model1 = lm( stfdem ~ trstprl, data = es2)
sjPlot::tab_model(model1)

	stfdem
Predictors	Estimates	CI	p
(Intercept)	3.19	3.02 – 3.35	<0.001
trstprl	0.49	0.45 – 0.52	<0.001
Observations	2382
R² / adjusted R²	0.261 / 0.261

Linear regression model with 2 predictors

model2 = lm( stfdem ~ trstprl + trstplt , data = es2)
sjPlot::tab_model(model2)

	stfdem
Predictors	Estimates	CI	p
(Intercept)	3.02	2.86 – 3.19	<0.001
trstprl	0.37	0.33 – 0.41	<0.001
trstplt	0.18	0.14 – 0.22	<0.001
Observations	2382
R² / adjusted R²	0.285 / 0.284

Linear regression model with 3 predictors

model3 = lm( stfdem ~ trstprl + trstplt + vote , data = es2)
sjPlot::tab_model(model3)

	stfdem
Predictors	Estimates	CI	p
(Intercept)	2.99	2.82 – 3.16	<0.001
trstprl	0.37	0.33 – 0.41	<0.001
trstplt	0.18	0.14 – 0.22	<0.001
vote 2	0.13	-0.05 – 0.31	0.167
Observations	2382
R² / adjusted R²	0.285 / 0.284

Comparing Models

Anova helps us to compare models in which everything is the same, but several variables are added to one of them (or more), which are not taken into account in another model.

anova(model1, model2)

As we can see here, p-value is much less than 0.05, so we should look at the RSS value and consider model with it’s least value as a better one.
Thus, in this case, model with 2 predictors is better.

anova(model2, model3)

Now here we have a non-typical situation: the p-value is noticeably bigger than 0.05, which means that these two models are equally good and we can use any of those.
However we will prefer the third model, since it is more interesting.
As we can conclude, the fact whether a person voted or not hardly affects his or her satisfaction of democracy. Anyway, we won’t throw it away.

Checking Linear Regression Assumptions

Linear regression makes several assumptions about the data, such as :

Linearity of the data
Normality of residuals
Homogeneity of residuals variance
Independence of residuals error terms

par(mfrow = c(2, 2))
plot(model3)

Linearity assumption: at the Residuals vs.Fitted plot a horizontal line, without distinct patterns can be seen, which is surely a good thing. (Our data is linear)
At the Q-Q plot points follow the straight dashed line, which is a nice indicator of normally distributed residuals.
Scale-Location & Residuals vs. Leverage plot show us a red horizontal line with equally, though in a funny way, spread points. This corresponds with the homoscedasticity of our data.
On Residuals vs Leverage plot we can spot only a couple of outliers

Conclusion

Based on our analysis, after having modeled a mathematical function and checked its assumptions, we can make the following conclusions:

Trust in politics depends on trust in parliament. Together they are the main elements of our model, since they have the most significant effect on the satisfaction with democracy
This can not be said about the variable vote. Accordingly, the fact that a person takes part in elections or not does not play a huge role in constructing our model
After having checked the assumptions, we can conclude that they are held and our model is beautiful and describes the data in a good way

The final formula is:

\[ stfdem = 2.99 + 0.37 * trstprl + 0.18 * trstplt + 0.13 * voice 2 + e \] We can safely say that according to these variables, one can predict satisfaction with democracy in Ireland.

LR

2BK

30.04.2019