GlobalCorruption (Project 2)

Author

Wilfried Bilong

Background

I’m using the Global Corruption data set for this project. This data set ranks countries based on corruption using what they call a CPI score. “The Corruption Perceptions Index (CPI) is the most widely used global corruption ranking in the world. It measures how corrupt each country’s public sector is perceived to be, according to experts and business people. Each country’s score is a combination of at least 3 data sources drawn from 13 different corruption surveys and assessments. These data sources are collected by a variety of reputable institutions, including the World Bank and the World Economic Forum.”(Transparency International, 2024) There are two mains variables that tells us the corruption of a nation. The first is the CPI score which is a base 100 score. The higher the number (100/100) the cleaner or less corrupt it is. The second is the Ranking, the rankings differ per year but these rankings are their corruption relative to other countries. This data has been collected using 13 different surveys and assessments on the topic of corruption. Institutions like the World Bank and the World Economic Forum are responsible for collection of these data sources making it highly reputable. I actually chose this topic because of I had never heard of it before. “Corruption” is something that is widely spoken about across many different people in their respective countries but the fact that there was a science behind it peaked my interest. The goal is really to take a look at the data behind much of the perceived corruption found in governments and, though outside the scope of this project, possibly be able to see how true these claims actually are.

Source: Transparency International. (2024, January 30). The ABCs of the CPI: How the Corruption Perceptions Index is calculated - News. Transparency.org. https://www.transparency.org/en/news/how-cpi-scores-are-calculated

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library(readr)
library(scales)

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor
library(highcharter)
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
library(RColorBrewer)
GlobalCorruption <- read_csv("GlobalCorruption.csv")
Rows: 180 Columns: 34
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Country, ISO3, Region
dbl (31): CPI score 2020, Rank 2020, Sources 2020, Standard error 2020, CPI ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
setwd("/Users/wyembilong/Desktop/Data 110")

Data

head(GlobalCorruption)
# A tibble: 6 × 34
  Country     ISO3  Region `CPI score 2020` `Rank 2020` `Sources 2020`
  <chr>       <chr> <chr>             <dbl>       <dbl>          <dbl>
1 Denmark     DNK   WE/EU                88           1              8
2 New Zealand NZL   AP                   88           1              8
3 Finland     FIN   WE/EU                85           3              8
4 Singapore   SGP   AP                   85           3              9
5 Sweden      SWE   WE/EU                85           3              8
6 Switzerland CHE   WE/EU                85           3              7
# ℹ 28 more variables: `Standard error 2020` <dbl>, `CPI score 2019` <dbl>,
#   `Rank 2019` <dbl>, `Sources 2019` <dbl>, `Standard error 2019` <dbl>,
#   `CPI score 2018` <dbl>, `Rank 2018` <dbl>, `Sources 2018` <dbl>,
#   `Standard error 2018` <dbl>, `CPI score 2017` <dbl>, `Rank 2017` <dbl>,
#   `Sources 2017` <dbl>, `Standard error 2017` <dbl>, `CPI score 2016` <dbl>,
#   `Sources 2016` <dbl>, `Standard error 2016` <dbl>, `CPI score 2015` <dbl>,
#   `Sources 2015` <dbl>, `Standard error 2015` <dbl>, …
ggplot(GlobalCorruption, aes(x = `Region`, y = `CPI score 2020`)) + 
  theme_light(base_size = 12)

Graph_1 <- ggplot(GlobalCorruption, aes(x = `Region`, y = `CPI score 2020`)) + 
  theme_light(base_size = 12) + 
  labs(title = "2020 CPI Score by Region", caption = "Source = Transperency International", x = "Region") 
Graph_2 <- Graph_1 + 
  geom_bar(aes(x = `Region`, y = `CPI score 2020`, fill = `Rank 2020`), 
           position = "dodge", stat = "identity")
Graph_2

This is a simple graph comparing the CPI score from 2020 to the different regions. The fill is the rank of the countries in those regions for that same year and we can see that overwhelmingly Western Europe had the highest ranking (closest to 1 i.e darker blue) and Eastern Europe and Central Asia (ECA) had the Lowest ranking (higher numbers i.e. lighter blue). From this bar graph I can also infer a correlation between corruption and countries in the South Eastern parts of the world.

Linear Regression Analysis

SE_Region <- 
  filter(GlobalCorruption, GlobalCorruption$Region == "SSA")

SE_Region2 <- 
  filter(GlobalCorruption, GlobalCorruption$Region == "ECA") 

SE_Combined <- rbind(SE_Region, SE_Region2)

head(SE_Combined)
# A tibble: 6 × 34
  Country    ISO3  Region `CPI score 2020` `Rank 2020` `Sources 2020`
  <chr>      <chr> <chr>             <dbl>       <dbl>          <dbl>
1 Seychelles SYC   SSA                  66          27              4
2 Botswana   BWA   SSA                  60          35              7
3 Cabo Verde CPV   SSA                  58          41              4
4 Rwanda     RWA   SSA                  54          49              7
5 Mauritius  MUS   SSA                  53          52              6
6 Namibia    NAM   SSA                  51          57              7
# ℹ 28 more variables: `Standard error 2020` <dbl>, `CPI score 2019` <dbl>,
#   `Rank 2019` <dbl>, `Sources 2019` <dbl>, `Standard error 2019` <dbl>,
#   `CPI score 2018` <dbl>, `Rank 2018` <dbl>, `Sources 2018` <dbl>,
#   `Standard error 2018` <dbl>, `CPI score 2017` <dbl>, `Rank 2017` <dbl>,
#   `Sources 2017` <dbl>, `Standard error 2017` <dbl>, `CPI score 2016` <dbl>,
#   `Sources 2016` <dbl>, `Standard error 2016` <dbl>, `CPI score 2015` <dbl>,
#   `Sources 2015` <dbl>, `Standard error 2015` <dbl>, …

The above is a new data set that I created only using countries in Sub Saharan African (SSA) and Eastern Europe and Central Asia (ECA). I started by filter the regions individually “SE_Region” for SSA and “SE_Region2” for ECA. I then merged the two data sets (SE_Combined) using the “rbind” function. I’ll use this first to do a my linear regression.

PLin_Reg <- 
  ggplot(SE_Combined, aes(x = `CPI score 2020`, y = `Standard error 2020`)) + 
  theme_light(base_size = 12) + 
  labs(title = "CPI Score vs Standard Error in 2020", x = "CPI Score", y = "Standard Error") + 
  geom_point(aes(color = `Region`))
PLin_Reg

This is a scatter plot of the CPI Scores for these South Easter regions being compared against the standard error for the calculation of these scores.

PLin_Reg2 <- 
  PLin_Reg + geom_smooth(color = "red")
PLin_Reg2
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Between the two graphs we can see a somewhat linear relationship between the CPI score and the Standard Error. This seems to show that on average the scores can actually be trusted and the margins of error are rarely high enough to take into consideration. This is a simple test showing a level of reliability that this data has. In general there is a weaker association between these two variable but taking into account the kind of variable standard error is, to me that’s actually a good sign.

cor(SE_Combined$`CPI score 2020`,SE_Combined$`Standard error 2020`)
[1] 0.1953951

This is the correlation coefficient that i calculated using the “cor” function. The fact that the number is so much closer to 0 than to 1 shows what I means about how weak the association of these variables is.

Equation <- lm(`Standard error 2020` ~ `CPI score 2020`, data = SE_Combined)
summary(Equation)

Call:
lm(formula = `Standard error 2020` ~ `CPI score 2020`, data = SE_Combined)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5444 -0.8442 -0.4434  0.6056  3.8272 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)   
(Intercept)       1.50050    0.43877   3.420  0.00108 **
`CPI score 2020`  0.02011    0.01242   1.619  0.11030   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.199 on 66 degrees of freedom
Multiple R-squared:  0.03818,   Adjusted R-squared:  0.02361 
F-statistic:  2.62 on 1 and 66 DF,  p-value: 0.1103

In this case SE means standard error and CPI means CPI Score.

SE = 0.020(CPI) + 1.501

As I stated before the correlation between these two is extremely weak but i’m glad that is the case. Based on the data found in this function combined with what we already knew, there is not that much of a connection between CPI Score and Standard Error for calculating the same. To mean that means things like tampering and forgery aren’t being found in the data, in other words the Standard Error is random enough for the CPI Scores themselves to be trustworthy.

Data Visualization

highchart() |>
  hc_add_series(data = SE_Combined, 
                type = "line", hcaes(x = `CPI score 2020`, 
                y = `Rank 2020`, 
                group = `Country`))

Basic graph of the CPI Score vs rank of countries in the South East region.

color <- brewer.pal(4, "Set2")

highchart() |>
  hc_add_series(data = SE_Combined, 
                type = "line", hcaes(x = `CPI score 2020`, 
                y = `Rank 2020`, 
                group = `Country`)) |>
  hc_colors(color)

Adding color

highchart() |>
  hc_add_series(data = SE_Combined, 
                type = "line", hcaes(x = `CPI score 2020`, 
                y = `Rank 2020`, 
                group = `Sources 2020`)) |>
  hc_colors(color) |> 
  hc_xAxis(title = list(text = "CPI Score")) |>
  hc_yAxis(title = list(text = "Rank"))

Adding x and y Axes.

Final_Plot <-
highchart() |>
  hc_add_series(data = SE_Combined, 
                type = "line", hcaes(x = `Country`, 
                y = `Rank 2020`, 
                group = `Sources 2020`, shape = `Country`)) |>
  hc_colors(color) |> 
  hc_xAxis(title = list(text = "Country")) |>
  hc_yAxis(title = list(text = "Rank")) |> 
  hc_legend(align = "left", 
            verticalAlign = "top")

  
Final_Plot

In this graph I wanted to see if the number of sources used in the compiling of a score had any affect on its ranking. Surprisingly the more sources a country had the more corrupt it turned out to be. This could mean a variety of things but for the most part it seems as though the companies that look into corruption take their time when it comes to countries in these South Eastern regions.

Conclusion

To conclude I found this data set very interesting. It seemed to me that countries in the south eastern parts of the world including those in Africa and parts of Europe and Asia always seemed to be the ones called out for their corruption. I felt it was important then to take a closer look at them being that sometimes those countries are left with incomplete data for a variety of reasons. I honestly expected to see signs of corruption (no pun intended) even within the data being used specifically for these countries. Its possible that someone can favor other countries in the west but this doesn’t seem to be the case. The assessments used to collect this data seem to be that much more prevalent in countries with lower ranks (50+) than in those that have the highest rankings (50-).