In this work, we use
The Corruption Perceptions Index (CPI) 2020: Score timeseries since 2012
to demonstrate the level of corporate and political corruption in
different region over years. This index is an annual dataset compiled
and published by Transparency International, a global non-governmental
organization that monitors and publicizes corporate and political
corruption, the data could be found under this link.
The data set provides scores on a scale of 0 to 100, with higher scores indicating lower levels of perceived corruption, for 180 countries and territories around the world. The scores are based on a combination of surveys and assessments of corruption in the public sector, carried out by experts and business executives.
The CPI is widely used by academics, policymakers, and journalists as a measure of corruption levels in different countries, and to monitor trends in corruption over time. The 2020 data set is the most recent available, and includes scores for each country and territory for every year from 2012 to 2020, allowing for longitudinal analysis of trends in perceived corruption.
The variables in the data set include:
Country: Name of Country
ISO3: a three-letter country code
Region: region codes, in particular:
CPI score YYYY: Corruption Perceptions Index in year YYYY, from year 2012 to year 2020. The range of score is between 0 and 100, a higher score means lower level of Corruption Perceptions.
Rank YYYY: the rank of country based on CPI score in year YYYY.
Sources YYYY: the data source from year YYYY
Standard error YYYY: the standard error of the CPI score in year YYYY.
Totally, there are 180 rows and 34 columns in the data set. The data set is quite clean after skipping the first two rows in the csv file, however, we make some modification of the data structure, which will be demonstrated in following section.
Firstly, we import the library and load the data, and set the default theme for ggplot visualization.
library(tidyverse)
library(pander)
theme_set(theme_minimal(base_size = 11,
base_family = "mono"))
df = read_csv("GlobalCorruption.csv", skip = 2)
Our question is on the level of corporate and political corruption in different region over years, hence, we only keep required columns, that is, Region and CPI score over years.
cpi = df %>%
select( Region, which(grepl('CPI', colnames(df))))
We then pivot the table to long format, with this, our data only have
three columns, one for region, one for years, which is YYYY that
extracted from original columns CPI score YYYY, and one for
CPI score. Following is the sample rows:
cpi %>%
pivot_longer(cols = -c(Region)) %>%
mutate(
years = str_extract(name, '[0-9]+') %>%
as.integer()
) %>%
rename(CPI = value) %>%
select(-name) -> cpi
cpi %>%
head(3) %>%
pander(caption = "sample rows")
| Region | CPI | years |
|---|---|---|
| WE/EU | 88 | 2020 |
| WE/EU | 87 | 2019 |
| WE/EU | 88 | 2018 |
Next, we create a help column, which is the average CPI score for each region, we will use this columns to reorder the x axis in the visualization, so that the difference over region will be more clear.
cpi%>%
group_by(Region)%>%
mutate(avg_cpi = mean(CPI, na.rm = T)) %>%
ungroup() ->cpi
Finally, we create our visualization:
cpi %>%
ggplot(aes(Region %>%
reorder(avg_cpi), CPI, fill = as.factor(years))) +
geom_boxplot() +
theme(legend.position = "bottom", plot.title = element_text(size = 13, family = "serif")) +
guides(fill = guide_legend(nrow = 1))+
labs(fill = 'year', x = NULL)
Distribution of CPI over Time by Regions
From the visualization, it is obvious that:
We work the data on region level, from the visualization above, we find there are outliers, which suggests us work on the country level. Moreover, we focus on the CPI score, we may also take research on the rank of countries based on CPI scores.