knitr::opts_chunk$set(echo = T, message = F, warning = F, out.width="100%",
                      out.height = "100%")
library(wbstats) #to load data from World Bank
library(DT) #to transform the data
library(ggplot2) #to visualize the data
library(dplyr) #to transform the data
library(readr) #to read csv
library(corrplot) #to build correlation plot
library(countrycode) #to match country name with country code
library(ggthemes) #theme for graphics
library(visdat) #to visualize NA
library(ggrepel) #to visualize country names
library(plotly) #to create interactive graphs
library(olsrr) # to check linear regression assumptions
library(paletteer) #color palette
library(stargazer)
#set working directory
setwd("C:/upwork/Monica/milestone4")

Data preparation

EDA

Correlation matrix

Figures below depicts the correlation matrix. Positive correlation coefficients are highlighted in blue colors, negative correlation coefficients are in the red color palette. Corruption perception index has a relatively strong negative correlation with freedom, rule of law, control of corruption, government effectivness and regulatory quality. With other variables, corruption perception index correlates weakly. Rule of law, control of corruption, government effectivness and regulatory qualityngly strongly correlate with each other. It is necessary to pay attention to this since this fact can cause the problem of multicollinearity in the predictive model. We use this information to build a regression model to predict corruption perception index. However, correlation does not imply causation.

Top 20 сountries with the highest corruption perception index

The graph below shows us 20 countries with the highest corruption perception index in 2016. In the first place is Moldova, in the second place is Bosnia and Herzegovina, in third is Romania. The top mostly includes developed countries.

Top 20 сountries with the lowest corruption perception index

Top 20 countries with the lowest corruption perception index in 2016 have predominantly developed countries except for countries such as Rwanda and Somalia In the first place is Singapore, in the second is Rwanda, in third is Denmark.

Top 10 countries with the lowest Happiness score

This graph is interactive. You can put your cursor on the graph to find out more information.

Top 10 countries with the highest Happiness score

This graph is interactive. You can put your cursor on the graph to find out more information.

Corruption Perception Index and GDP

In the graph below we see the histogram of the corruption perception index. The median value of this index is 0.811

This graph shows how is GDP per capita($) and corruption perception index related. The red line is the median value of the corruption perception index. The blue line is the median value of the GDP per capita. There is no linear relationship between these parameters. All Western and North European countries (except Lithuania and Latvia), Australia and New Zealand have a corruption perception index lower the median value. Eastern European countries have GDP per capita lower than in Western Europe. Eastern European countries have a corruption perception index higher than the median (except Belarus). Sub-Saharan Africa, Latina America, South Europe have corruption perception index at the median level and above it.

Corruption Perception Index and Freedom

In the graph below we see the histogram of the freedom. The median value of this index is 0.774

In the graph below corruption perception index and freedom depending on sub-region is depicted.The red line is the median value corruption perception index. The blue line is the median value of freedom. There is no linear relationship between freedom and corruption perception index. Also, there is no division into clusters according to the regions or sub-regions. Most countries in the graph are located around the median value of the corruption perception index regardless of the freedom value. The developed European and North American countries have higher freedom and lower corruption perception index than in other countries.

In the graph below corruption perception index and freedom depending on the economic status is depicted. We can see two clusters: developed countries (light blue color) and developing countries (light pink color). Developed countries have a relatively low negative correlation (-0.504) between freedom and corruption. In developing countries correlation between these parameters is lower (-0.322).

Here you can see clearly the relationship between the corruption perception index and freedom depending on the Economic status. Median values of freedom (blue) and corruption perception index (red) are calculated for each group of development.

Corruption Perception Index and Control of Corruption

In the graph below we see the histogram of the control of corruption. The median value of this index is -0.374

This graph depicted the corruption perception index and control of corruption by sub-region. Red line on the graph shows the median value of the corruption perception index. The blue line is the median value control of corruption. Majority of sub-regions are on the median value, exceptions are Western Europe, Northen America, Australia and New Zealand. In South-Eastern Asia countries have very different perceptions and control of corruption, therefore, Singapour has the lowest value of corruption perception index and one of the highest control of corruption when Vietnam, Philipines, Thailand are on the median value of the corruption perception. Such countries as Somalia, Rwanda from the Sub-Saharan region most probably are outliers. Western Europe, Northen America, Australia and New Zealand have one of the lowest corruption perceptions and one of the highest corruption control. Graphs with the corruption perception index and Rule of law, corruption perception index and Government effectiveness, corruption perception index and Regulatory quality, share entirely the same pattern as the graph with corruption perception index and control of corruption.

In the graph below corruption perception index and control of corruption depending on the economic status are depicted. We can see two clusters: developed countries (light blue color) and developing countries (light pink color). Developed countries have a high negative correlation (-0.731) between control of corruption and corruption. In developing countries correlation between these parameters is low (-0.268)

Here you can see clearly the relationship between the corruption perception index and control of corruption depending on the Economic status. Median values of control of corruption (blue) and corruption perception index (red) are calculated for each group of development.

Corruption Perception Index and Rule of Law

In the graph below we see the histogram of the rule of law. The median value of this index is -0.264

Corruption Perception Index and Government Effectiveness

In the graph below we see the histogram of the government effectiveness. The median value of this index is -0.161

Corruption Perception Index and Regulatory Quality

In the graph below we see the histogram of the regulatory quality. The median value of this index is -0.107

Corruption perception index and happiness score depending on sub-region

On the graph below corruption perception index and happiness score depending on sub-region are depicted. We don’t see here a linear relationship between the corruption perception index and the happiness score. On the graph, we see several clusters of countries such as Sub-Sahara Africa countries (except for Mauritius), Latin America and Carribean countries (except for Haiti and Venezuela ), Western Europe with North America and ANZ. Sub-Saharan Africa countries have a happiness score lower than the median. Most Latin American and Carribean countries have a happiness score higher than the median. These two clusters are characterized by a corruption perception index of 0.67 to 0.9. Western Europe (Northern countries) with North America and ANZ have the lowest corruption perception index (less than 0.52) and the highest happiness score (more than 6).

Regression Models

Let’s build a linear regression model to explore what factors influence the corruption perception index. Rule of law, government effectiveness, regulatory quality are not used in models as explanatory variables in order to avoid multicollinearity problems (because the correlation between variables about law from World Bank is high (more than 0.9)).

Interpretation

Ceteris paribus:

b0 - intercept

b1 = wbo.ctr_corrupt - regression coefficient for linear effect of X on Y

b2 = I(wbo.ctr_corrupt2) - regression coefficient for quadratic effect of X on Y

The value of -0.065 represents the downward linear trend in the value of Y along the X-axis and the value of -0.072 represents the curvature in the data. Here the quadratic term is negative implying that the corruption perception index decreases as control of corruption increases. The adjusted R2 of this model is 0.656.

Dependent variable:
corruption
freedom -0.235***
(0.080)
wbo.ctr_corrupt -0.065***
(0.013)
I(wbo.ctr_corrupt2) -0.072***
(0.008)
economic_statusDeveloping -0.065***
(0.023)
Constant 1.025***
(0.062)
Observations 142
R2 0.665
Adjusted R2 0.656
Residual Std. Error 0.107 (df = 137)
F Statistic 68.078*** (df = 4; 137)
Note: p<0.1; p<0.05; p<0.01

The stochastic (disturbance) of error term is normally distributed (Kolmogorov-Smirnov test).

## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9402         0.0000 
## Kolmogorov-Smirnov        0.0954         0.1510 
## Cramer-von Mises         38.9011         0.0000 
## Anderson-Darling          1.7517          2e-04 
## -----------------------------------------------

In the EDA part we saw that there are two clusters: developed and developing countries. Let’s try to build the models for each group. Results you can see in the table below. The 2-nd model is the best in this case. This model is built on data about 95 developed countries. This model could be used to predict the corruption perception index for developed countries. The adjusted R2 of this model is 0.736. For this model as an explanatory variables control of corruption and squared term of control of corruption are used.

b0 - intercept

b1 = wbo.ctr_corrupt - regression coefficient for linear effect of X on Y

b2 = I(wbo.ctr_corrupt2) - regression coefficient for quadratic effect of X on Y

The value of -0.048 represents the downward linear trend in the value of Y along the X-axis and the value of -0.092 represents the curvature in the data. Here the quadratic term is negative implying that the corruption perception index decreases as control of corruption increases. The adjusted R2 of this model is 0.736.

Models with data about developing countries have low R2.

Dependent variable:
corruption
(1) (2) (3) (4) (5) (6)
wbo.ctr_corrupt -0.045*** -0.048*** -0.392*** -0.411***
(0.015) (0.015) (0.055) (0.061)
I(wbo.ctr_corrupt2) -0.088*** -0.092*** -0.117*** -0.247*** -0.234***
(0.011) (0.011) (0.008) (0.034) (0.037)
freedom -0.128 -0.329*** 1.594*
(0.102) (0.102) (0.901)
I(freedom2) -1.437**
(0.674)
Constant 0.955*** 0.860*** 0.874*** 0.921*** 0.655*** 0.401
(0.076) (0.014) (0.014) (0.086) (0.027) (0.297)
Observations 95 95 95 47 47 47
R2 0.746 0.742 0.713 0.606 0.511 0.187
Adjusted R2 0.738 0.736 0.710 0.578 0.489 0.150
Residual Std. Error 0.104 (df = 91) 0.105 (df = 92) 0.110 (df = 93) 0.082 (df = 43) 0.090 (df = 44) 0.116 (df = 44)
F Statistic 89.136*** (df = 3; 91) 132.074*** (df = 2; 92) 231.021*** (df = 1; 93) 22.016*** (df = 3; 43) 22.975*** (df = 2; 44) 5.074** (df = 2; 44)
Note: p<0.1; p<0.05; p<0.01

The stochastic (disturbance) of error term is normally distributed (Kolmogorov-Smirnov test).

## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9863         0.4304 
## Kolmogorov-Smirnov        0.0591         0.8748 
## Cramer-von Mises          25.482         0.0000 
## Anderson-Darling          0.4608         0.2546 
## -----------------------------------------------

Regression results visualization

A residual is a difference between the observed value and the mean value that the model predicts for that observation. Residuals in the second linear regression model are represented on the graph below. They are shown as light grey straight lines. Observed values are shown by colored points. The redder and larger the point, the further it is from the predicted value. The blacker and smaller the point, the closer it is to the predicted value. The black (transparent inside) points that lie on the line are the mean values that the model predicts for that observations.

First of all, the line is curved because between control of corruption and corruption perception index there is a quadratic relationship (quadratic function), and a graph that describes this function is a parabola. Equation of a parabola in math is y = ax^2 + bx +c.  In our case it is: corruption perception index = -0.092 * (control of corruption)^2 - 0.048 * control of corruption + 0.860

where a = -0.092, b = 0.048, c = 0.860, x = control of corruption

In our case parabola at first, goes up (if control of corruption < -0.26) and then goes down (if control of corruption > -0.26). It means if control of corruption is less than -0.26 increase control of corruption leads to increase corruption perception index, but if control of corruption is more than -0.26 increase control of corruption leads to decrease corruption perception index.

How I received “control of corruption” = -0.26? It is math.

  1. take derivative from our function.

y’ = (ax^2 + bx +c)’

y = 2ax + x

or

(corruption perception index)’ = (-0.092 * (control of corruption)^2 - 0.048 * control of corruption + 0.860)’

(corruption perception index)’ = -0,184 * control of corruption - 0.048

  1. Now find when the slope is zero:

-0,184 * control of corruption - 0.048 = 0

x = -0.26

Example from our graph:

Corruption Perception index increase if control of corruption is less than -0.26. Let’s compare 2 countries

  1. Uzbekistan (Control of Corruption -1.169; Corruption perception index 0.84)

  2. Mongolia (Control of Corruption -0.487; Corruption perception index 0.9)

In Uzbekistan Control of Corruption is less than in Mongolia (-1.169<-0.487) but Corruption perception index in Uzbekistan is less than in Mongolia (0.84<0.9) Corruption Perception index decrease if control of corruption is more than -0.26.

Let’s compare 2 countries

  1. Latvia(Control of Corruption 0.431; Corruption perception index 0.9)

  2. Austria (Control of Corruption 1.549; Corruption perception index 0.524)

In Latvia Control of Corruption is less than in Austria (0.431<1.549) and Corruption perception index in Latvia is more than in Austria (0.9>0.524)

Regression model: Happiness score and control of corruption

The first model describes the relationship between Happiness score based on data about 142 countries (developed and developing). The second model contains information only about developing countries, the third is about developed. There is positive correlation (0.7) between Happiness score and control of corruption. Based on the results of the first model, we can draw the following conclusion: an increase in control of corruption by 0.1 leads to an increase in happiness score by 0.0765. Control of corruption is statistically significant with a p-value less than 0.01 and 48.7% variability of the response data can be explained by the model, with a normal distribution of the residual and relatively constant variance of error terms.

The second model (developing countries) is not reliable. Control of corruption is not statistically significant.

Based on the results of the third model, we can draw the following conclusion: an increase in control of corruption by 0.1 leads to an increase in happiness score by 0.06. Control of corruption is statistically significant with a p-value less than 0.01 and 46.3% variability of the response data can be explained by the model, with a normal distribution of the residual and relatively constant variance of error terms.

Dependent variable:
score
(1) (2) (3)
wbo.ctr_corrupt 0.765*** 0.201 0.600***
(0.066) (0.250) (0.066)
Constant 5.442*** 4.546*** 5.712***
(0.069) (0.237) (0.072)
Observations 142 47 95
R2 0.490 0.014 0.469
Adjusted R2 0.487 -0.008 0.463
Residual Std. Error 0.817 (df = 140) 0.840 (df = 45) 0.670 (df = 93)
F Statistic 134.725*** (df = 1; 140) 0.651 (df = 1; 45) 82.108*** (df = 1; 93)
Note: p<0.1; p<0.05; p<0.01

Heteroscedasticity check and Normality tests(142 countries)

## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.978          0.0217 
## Kolmogorov-Smirnov        0.0863         0.2403 
## Cramer-von Mises          9.778          0.0000 
## Anderson-Darling          0.7654         0.0456 
## -----------------------------------------------

Heteroscedasticity check and Normality tests (developed countries)

## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9531         0.0019 
## Kolmogorov-Smirnov        0.0817         0.5239 
## Cramer-von Mises          7.235          0.0000 
## Anderson-Darling          0.8893         0.0222 
## -----------------------------------------------

Regression results visualization

Residuals in the first linear regression model are represented on the graph below. They are shown as light grey straight lines. Observed values are shown by colored points. The redder and larger the point, the further it is from the predicted value. The blacker and smaller the point, the closer it is to the predicted value. The black (transparent inside) points that lie on the line are the mean values that the model predicts for that observations.