Introduction

Hello, we are 2BK team (Bakhareva, Borisenko, Kireeva, Kuzmicheva) and we are happy to demonstrate our project on linear regression modeling. We are going to analyze how different factors influence on the satisfaction with democracy in Ireland (round 8). As predictor variables we have chosen the following variables: trust to parliament, voting on last elections and trust to politicians. Why we have chosen exactly these ones, we will tell you further.

As for the contribution:

  • Anastasia Bakhareva: boxplot construction, graph analysis, barplot construction, linear regression models
  • Iana Borisenko: linear regression assumptions check & analysis, equation construction
  • Irina Kireeva: correlation coefficients analysis, scatterplot construction
  • Daria Kuzmicheva: histogram construction, graph analysis, comparing models with ANOVA test

Analysis of Background

Since our topic is politics, we tried to find some interesting articles on our topic to have an inspiration for further analysis of variables. So, we came up to the articles that told us the following:

  • Within the set of liberal democracies, the Nordic countries tend to have the highest trust rates, (and Ireland is actually a Nordic country), and the confidence of people in the government is of a general nature: a high level of trust in one institution tends to spread to other institutions, such as trust in parliament and overall satisfaction with democracy .
  • The presence of voting procedure results in higher trust to the chosen leader.

In our analysis, we selected variables that hold data about the level of trust in politicians and parliament in Ireland, as well as about participation in elections and the level of satisfaction with democracy. For these variables, we will build a mathematical model, which will help us to predict the value of the output variable based on one or more of the input predictor variables.

Our variables are:

Label Meaning Level_Of_Measurement Measurement
trstprl Trust to parliament Interval 0 - 10
vote Voting on last elections Nominal Yes / No
stfdem Satisfaction with democracy Interval 0 - 10
trstplt Trust to politicians Interval 0 - 10

Exploring the data

So, first of all, we should have a glance on specifications of our dataset with the function summary.

##     trstprl      vote         stfdem          trstplt      
##  Min.   : 0.00   1:1859   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 3.00   2: 523   1st Qu.: 4.000   1st Qu.: 2.000  
##  Median : 5.00            Median : 5.000   Median : 4.000  
##  Mean   : 4.46            Mean   : 5.352   Mean   : 3.742  
##  3rd Qu.: 6.00            3rd Qu.: 7.000   3rd Qu.: 5.000  
##  Max.   :10.00            Max.   :10.000   Max.   :10.000

Seems legit, now it is time to check for outliers. We surely can do this with the graphs.

Then, we need to understand our variables from our dataset graphically. For that we will need to create:

  • Box plot, to spot outliers observations in the variables.
  • Density plot, to check if our variables distribution is close to normal.
  • Scatter plot, to visualize the linear relationship between the variables

Using Boxplot to check for outliers

We construct boxplots as follows:

From the boxplot we can see that the Y variables are quite normally distributed among the groups. It also can be seen that there are virtually no outliers except for one point in “trust in politicians” (it can be found on line 10 in our dataset). Moreover, it can be seen that trust in politicians has the lowest median of level of trust.

Using Barplots to check if categorical variable is representative

  • The groups are of comparable size.
  • As it can be seen, 80% of irish have a habit to participate in elections.

Looking at correlation coefficients

We will have a look on them on this fine visualisation:

  • From what we can see, all the relationship between our variables are pretty decent and have positive direction.
  • Each of the correlation coefficient is close to 0.5 value
  • What is interesting, that the highest correlation coefficient is between trust to politicians and trust to parlament. The presented values confirm the situation on the scatterplots.

Conducting Linear Regression Models

Since we have seen the linear relationship pictorially in the scatter plot and by computing the correlation, it is time for model conduction.

Linear regression model with 1 predictor

  stfdem
Predictors Estimates CI p
(Intercept) 3.19 3.02 – 3.35 <0.001
trstprl 0.49 0.45 – 0.52 <0.001
Observations 2382
R2 / adjusted R2 0.261 / 0.261

Linear regression model with 2 predictors

  stfdem
Predictors Estimates CI p
(Intercept) 3.02 2.86 – 3.19 <0.001
trstprl 0.37 0.33 – 0.41 <0.001
trstplt 0.18 0.14 – 0.22 <0.001
Observations 2382
R2 / adjusted R2 0.285 / 0.284

Linear regression model with 3 predictors

  stfdem
Predictors Estimates CI p
(Intercept) 2.99 2.82 – 3.16 <0.001
trstprl 0.37 0.33 – 0.41 <0.001
trstplt 0.18 0.14 – 0.22 <0.001
vote 2 0.13 -0.05 – 0.31 0.167
Observations 2382
R2 / adjusted R2 0.285 / 0.284

Comparing Models

Anova helps us to compare models in which everything is the same, but several variables are added to one of them (or more), which are not taken into account in another model.

  • As we can see here, p-value is much less than 0.05, so we should look at the RSS value and consider model with it’s least value as a better one.
  • Thus, in this case, model with 2 predictors is better.
  • Now here we have a non-typical situation: the p-value is noticeably bigger than 0.05, which means that these two models are equally good and we can use any of those.
  • However we will prefer the third model, since it is more interesting.
  • As we can conclude, the fact whether a person voted or not hardly affects his or her satisfaction of democracy. Anyway, we won’t throw it away.

Checking Linear Regression Assumptions

Linear regression makes several assumptions about the data, such as :

  • Linearity of the data
  • Normality of residuals
  • Homogeneity of residuals variance
  • Independence of residuals error terms

  • Linearity assumption: at the Residuals vs.Fitted plot a horizontal line, without distinct patterns can be seen, which is surely a good thing. (Our data is linear)
  • At the Q-Q plot points follow the straight dashed line, which is a nice indicator of normally distributed residuals.
  • Scale-Location & Residuals vs. Leverage plot show us a red horizontal line with equally, though in a funny way, spread points. This corresponds with the homoscedasticity of our data.
  • On Residuals vs Leverage plot we can spot only a couple of outliers

Conclusion

Based on our analysis, after having modeled a mathematical function and checked its assumptions, we can make the following conclusions:

  • Trust in politics depends on trust in parliament. Together they are the main elements of our model, since they have the most significant effect on the satisfaction with democracy
  • This can not be said about the variable vote. Accordingly, the fact that a person takes part in elections or not does not play a huge role in constructing our model
  • After having checked the assumptions, we can conclude that they are held and our model is beautiful and describes the data in a good way

The final formula is:

\[ stfdem = 2.99 + 0.37 * trstprl + 0.18 * trstplt + 0.13 * voice 2 + e \] We can safely say that according to these variables, one can predict satisfaction with democracy in Ireland.