I’m using the Global Corruption data set for this project. This data set ranks countries based on corruption using what they call a CPI score. “The Corruption Perceptions Index (CPI) is the most widely used global corruption ranking in the world. It measures how corrupt each country’s public sector is perceived to be, according to experts and business people. Each country’s score is a combination of at least 3 data sources drawn from 13 different corruption surveys and assessments. These data sources are collected by a variety of reputable institutions, including the World Bank and the World Economic Forum.”(Transparency International, 2024) There are two mains variables that tells us the corruption of a nation. The first is the CPI score which is a base 100 score. The higher the number (100/100) the cleaner or less corrupt it is. The second is the Ranking, the rankings differ per year but these rankings are their corruption relative to other countries. This data has been collected using 13 different surveys and assessments on the topic of corruption. Institutions like the World Bank and the World Economic Forum are responsible for collection of these data sources making it highly reputable. I actually chose this topic because of I had never heard of it before. “Corruption” is something that is widely spoken about across many different people in their respective countries but the fact that there was a science behind it peaked my interest. The goal is really to take a look at the data behind much of the perceived corruption found in governments and, though outside the scope of this project, possibly be able to see how true these claims actually are.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(readr)library(scales)
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
library(highcharter)
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Rows: 180 Columns: 34
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Country, ISO3, Region
dbl (31): CPI score 2020, Rank 2020, Sources 2020, Standard error 2020, CPI ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(GlobalCorruption, aes(x =`Region`, y =`CPI score 2020`)) +theme_light(base_size =12)
Graph_1 <-ggplot(GlobalCorruption, aes(x =`Region`, y =`CPI score 2020`)) +theme_light(base_size =12) +labs(title ="2020 CPI Score by Region", caption ="Source = Transperency International", x ="Region") Graph_2 <- Graph_1 +geom_bar(aes(x =`Region`, y =`CPI score 2020`, fill =`Rank 2020`), position ="dodge", stat ="identity")Graph_2
This is a simple graph comparing the CPI score from 2020 to the different regions. The fill is the rank of the countries in those regions for that same year and we can see that overwhelmingly Western Europe had the highest ranking (closest to 1 i.e darker blue) and Eastern Europe and Central Asia (ECA) had the Lowest ranking (higher numbers i.e. lighter blue). From this bar graph I can also infer a correlation between corruption and countries in the South Eastern parts of the world.
The above is a new data set that I created only using countries in Sub Saharan African (SSA) and Eastern Europe and Central Asia (ECA). I started by filter the regions individually “SE_Region” for SSA and “SE_Region2” for ECA. I then merged the two data sets (SE_Combined) using the “rbind” function. I’ll use this first to do a my linear regression.
PLin_Reg <-ggplot(SE_Combined, aes(x =`CPI score 2020`, y =`Standard error 2020`)) +theme_light(base_size =12) +labs(title ="CPI Score vs Standard Error in 2020", x ="CPI Score", y ="Standard Error") +geom_point(aes(color =`Region`))PLin_Reg
This is a scatter plot of the CPI Scores for these South Easter regions being compared against the standard error for the calculation of these scores.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Between the two graphs we can see a somewhat linear relationship between the CPI score and the Standard Error. This seems to show that on average the scores can actually be trusted and the margins of error are rarely high enough to take into consideration. This is a simple test showing a level of reliability that this data has. In general there is a weaker association between these two variable but taking into account the kind of variable standard error is, to me that’s actually a good sign.
This is the correlation coefficient that i calculated using the “cor” function. The fact that the number is so much closer to 0 than to 1 shows what I means about how weak the association of these variables is.
Equation <-lm(`Standard error 2020`~`CPI score 2020`, data = SE_Combined)summary(Equation)
Call:
lm(formula = `Standard error 2020` ~ `CPI score 2020`, data = SE_Combined)
Residuals:
Min 1Q Median 3Q Max
-1.5444 -0.8442 -0.4434 0.6056 3.8272
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.50050 0.43877 3.420 0.00108 **
`CPI score 2020` 0.02011 0.01242 1.619 0.11030
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.199 on 66 degrees of freedom
Multiple R-squared: 0.03818, Adjusted R-squared: 0.02361
F-statistic: 2.62 on 1 and 66 DF, p-value: 0.1103
In this case SE means standard error and CPI means CPI Score.
SE = 0.020(CPI) + 1.501
As I stated before the correlation between these two is extremely weak but i’m glad that is the case. Based on the data found in this function combined with what we already knew, there is not that much of a connection between CPI Score and Standard Error for calculating the same. To mean that means things like tampering and forgery aren’t being found in the data, in other words the Standard Error is random enough for the CPI Scores themselves to be trustworthy.
Data Visualization
highchart() |>hc_add_series(data = SE_Combined, type ="line", hcaes(x =`CPI score 2020`, y =`Rank 2020`, group =`Country`))
Basic graph of the CPI Score vs rank of countries in the South East region.
color <-brewer.pal(4, "Set2")highchart() |>hc_add_series(data = SE_Combined, type ="line", hcaes(x =`CPI score 2020`, y =`Rank 2020`, group =`Country`)) |>hc_colors(color)
Adding color
highchart() |>hc_add_series(data = SE_Combined, type ="line", hcaes(x =`CPI score 2020`, y =`Rank 2020`, group =`Sources 2020`)) |>hc_colors(color) |>hc_xAxis(title =list(text ="CPI Score")) |>hc_yAxis(title =list(text ="Rank"))
Adding x and y Axes.
Final_Plot <-highchart() |>hc_add_series(data = SE_Combined, type ="line", hcaes(x =`Country`, y =`Rank 2020`, group =`Sources 2020`, shape =`Country`)) |>hc_colors(color) |>hc_xAxis(title =list(text ="Country")) |>hc_yAxis(title =list(text ="Rank")) |>hc_legend(align ="left", verticalAlign ="top")Final_Plot
In this graph I wanted to see if the number of sources used in the compiling of a score had any affect on its ranking. Surprisingly the more sources a country had the more corrupt it turned out to be. This could mean a variety of things but for the most part it seems as though the companies that look into corruption take their time when it comes to countries in these South Eastern regions.
Conclusion
To conclude I found this data set very interesting. It seemed to me that countries in the south eastern parts of the world including those in Africa and parts of Europe and Asia always seemed to be the ones called out for their corruption. I felt it was important then to take a closer look at them being that sometimes those countries are left with incomplete data for a variety of reasons. I honestly expected to see signs of corruption (no pun intended) even within the data being used specifically for these countries. Its possible that someone can favor other countries in the west but this doesn’t seem to be the case. The assessments used to collect this data seem to be that much more prevalent in countries with lower ranks (50+) than in those that have the highest rankings (50-).