Project 1 Global Corruption

Author

Doria Shima

Project 1 Global Corruption

Introduction

The dataset I chose is about Global Corruption around the world specifically presented with 34 variblaes which include the Corruption Perceptions Index (CPI) score, Standard Error, Rank, and Sources from 2012 to 2020.

For this analysis below, my main question is to explore whether the CPI from previous year helps predict the CPI from 2020? CPI is the most widely used global corruption ranking in the world and measures how currupt a country’s public sector is perceived to be.

The source of my dataset is from Transparency International which mission is to stop corruptio and promote transparency, accountability and integrity at all levels and across all sectors of society. https://www.transparency.org/en/about

Load the library packages

The first step for this analysis was to ensure that all the library packages needed are loaded. To do this, I have created a chunk and used library() with the specific packages I need and then I ran the chunk to load them. Packages are fundamental units of reproducible R code and include reusable functions, documentation that describe how to use them and sample data.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(zoo)

Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
library(ggfortify)
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library(GGally)
library(htmltools)
library(alluvial)
library(ggalluvial)
data("globalcorruptiondataset")
Warning in data("globalcorruptiondataset"): data set 'globalcorruptiondataset'
not found
# source: Transparency International

Find my working directory

The second step was to find my working directory by running the chunk using the command getwd.

getwd()
[1] "/Users/doriashima/Desktop/data visualization course"

Load the dataset

The third step was to load the globalcorruptiondataset from my working directory using setwd and < - read_csv. After that I used the function head to explore the see the first few lines of the dataset and the various columns. I was able to add the dataset in the global environment and see it clearly.

setwd("/Users/doriashima/Desktop/data visualization course")
globalcorruptiondataset <- read_csv("GlobalCorruption.csv",skip = 2)
Rows: 180 Columns: 34
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Country, ISO3, Region
dbl (31): CPI score 2020, Rank 2020, Sources 2020, Standard error 2020, CPI ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(globalcorruptiondataset)
# A tibble: 6 × 34
  Country     ISO3  Region `CPI score 2020` `Rank 2020` `Sources 2020`
  <chr>       <chr> <chr>             <dbl>       <dbl>          <dbl>
1 Denmark     DNK   WE/EU                88           1              8
2 New Zealand NZL   AP                   88           1              8
3 Finland     FIN   WE/EU                85           3              8
4 Singapore   SGP   AP                   85           3              9
5 Sweden      SWE   WE/EU                85           3              8
6 Switzerland CHE   WE/EU                85           3              7
# ℹ 28 more variables: `Standard error 2020` <dbl>, `CPI score 2019` <dbl>,
#   `Rank 2019` <dbl>, `Sources 2019` <dbl>, `Standard error 2019` <dbl>,
#   `CPI score 2018` <dbl>, `Rank 2018` <dbl>, `Sources 2018` <dbl>,
#   `Standard error 2018` <dbl>, `CPI score 2017` <dbl>, `Rank 2017` <dbl>,
#   `Sources 2017` <dbl>, `Standard error 2017` <dbl>, `CPI score 2016` <dbl>,
#   `Sources 2016` <dbl>, `Standard error 2016` <dbl>, `CPI score 2015` <dbl>,
#   `Sources 2015` <dbl>, `Standard error 2015` <dbl>, …

Clean the data

My fourth step was to clean the data by setting all the variable names to lowercase to avoid keeping track of capitalizing. I also ensure that there are no spaces between variable names. I ran head again to look at the changes made.

names(globalcorruptiondataset) <- tolower(names(globalcorruptiondataset))
names(globalcorruptiondataset) <- gsub(" ","_",names(globalcorruptiondataset))
head(globalcorruptiondataset)
# A tibble: 6 × 34
  country iso3  region cpi_score_2020 rank_2020 sources_2020 standard_error_2020
  <chr>   <chr> <chr>           <dbl>     <dbl>        <dbl>               <dbl>
1 Denmark DNK   WE/EU              88         1            8                1.78
2 New Ze… NZL   AP                 88         1            8                1.48
3 Finland FIN   WE/EU              85         3            8                1.75
4 Singap… SGP   AP                 85         3            9                1.2 
5 Sweden  SWE   WE/EU              85         3            8                1.3 
6 Switze… CHE   WE/EU              85         3            7                1.1 
# ℹ 27 more variables: cpi_score_2019 <dbl>, rank_2019 <dbl>,
#   sources_2019 <dbl>, standard_error_2019 <dbl>, cpi_score_2018 <dbl>,
#   rank_2018 <dbl>, sources_2018 <dbl>, standard_error_2018 <dbl>,
#   cpi_score_2017 <dbl>, rank_2017 <dbl>, sources_2017 <dbl>,
#   standard_error_2017 <dbl>, cpi_score_2016 <dbl>, sources_2016 <dbl>,
#   standard_error_2016 <dbl>, cpi_score_2015 <dbl>, sources_2015 <dbl>,
#   standard_error_2015 <dbl>, cpi_score_2014 <dbl>, sources_2014 <dbl>, …
dim(globalcorruptiondataset)
[1] 180  34

Rename the region codes to full names

My fifth step was to rename the region codes to full names.

globalcorruptiondataset$region[globalcorruptiondataset$region == "WE/EU"] <- "Western Europe"
globalcorruptiondataset$region[globalcorruptiondataset$region == "AP"] <- "Asia Pacific"
globalcorruptiondataset$region[globalcorruptiondataset$region == "MENA"] <- "Middle East and North Africa"
globalcorruptiondataset$region[globalcorruptiondataset$region == "ECA"] <- "Europe and Central Asia"
globalcorruptiondataset$region[globalcorruptiondataset$region == "SSA"] <- "Sub-Saharan Africa"
globalcorruptiondataset$region[globalcorruptiondataset$region == "AME"] <- "Americas"

Inclusion / exclusion criteria

My sixth step was to use some inclusion / exclusion criteria to select desired columns to use for my regression model. I wanted to select Western Europe and Sub Saharan Africa as regions as well as the cpi_score from 2012 to 2020. I also want to remove na’s by filtering.

#globalcorruptionone <- globalcorruptiondataset |>
#  filter(region %in% c("Western Europe", "Sub Saharan Africa"))
global <- globalcorruptiondataset |> 
 filter(!is.na("cpi_score_2020") & !is.na("cpi_score_2019") & !is.na("cpi_score_2018") & !is.na("cpi_score_2017") & !is.na("cpi_score_2016") & !is.na("cpi_score_2015") & !is.na("cpi_score_2014") & !is.na("cpi_score_2013") & !is.na("cpi_score_2012"))

Regression model

In this step, I want to look at whether the CPI from previous year helps predict the CPI from 2020. I do this by using the command lm(y ~ x) to fit the predictor variable x into the model to predict y. I have picked all the variables and conducted a multiple linear.

The column Pr(>|t|) p-value describes whether the predictor is useful to the model. The more asterisks, the more the variable contributes to the model. The adjusted r squared shows me whether the variation is explained by the model.

The correlation is a metric and has a scale from -1 to 1 where -1 to 0 shows a negative correlation, 0 shows no correlation and 0 to 1 shows some moderate to strong correlation.

Once I find the highest correlation, I will create a visualization with the highest correlation.

cor(global$cpi_score_2019, global$cpi_score_2020)
[1] 0.99559
fit1 <- lm(cpi_score_2020~cpi_score_2019+cpi_score_2018+cpi_score_2017+cpi_score_2016+cpi_score_2015+cpi_score_2013+cpi_score_2012+region,data = global)
summary(fit1)

Call:
lm(formula = cpi_score_2020 ~ cpi_score_2019 + cpi_score_2018 + 
    cpi_score_2017 + cpi_score_2016 + cpi_score_2015 + cpi_score_2013 + 
    cpi_score_2012 + region, data = global)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.2316 -0.8081  0.1177  0.7662  4.8123 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                        -0.032408   0.430078  -0.075  0.94003    
cpi_score_2019                      1.099116   0.059825  18.372  < 2e-16 ***
cpi_score_2018                     -0.093498   0.081324  -1.150  0.25206    
cpi_score_2017                      0.057108   0.081088   0.704  0.48233    
cpi_score_2016                     -0.124048   0.069728  -1.779  0.07722 .  
cpi_score_2015                      0.048585   0.059395   0.818  0.41463    
cpi_score_2013                      0.031481   0.065558   0.480  0.63177    
cpi_score_2012                     -0.019414   0.051693  -0.376  0.70777    
regionAsia Pacific                  0.079431   0.408301   0.195  0.84601    
regionEurope and Central Asia       1.250786   0.444322   2.815  0.00552 ** 
regionMiddle East and North Africa -0.008905   0.469914  -0.019  0.98491    
regionSub-Saharan Africa            0.126755   0.365382   0.347  0.72914    
regionWestern Europe               -0.189454   0.438926  -0.432  0.66662    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.44 on 153 degrees of freedom
  (14 observations deleted due to missingness)
Multiple R-squared:  0.9947,    Adjusted R-squared:  0.9943 
F-statistic:  2386 on 12 and 153 DF,  p-value: < 2.2e-16
fit1 <- lm(cpi_score_2020~cpi_score_2019+cpi_score_2018+cpi_score_2017+cpi_score_2016+cpi_score_2015+cpi_score_2013+cpi_score_2012+region,data = global)
summary(fit1)

Call:
lm(formula = cpi_score_2020 ~ cpi_score_2019 + cpi_score_2018 + 
    cpi_score_2017 + cpi_score_2016 + cpi_score_2015 + cpi_score_2013 + 
    cpi_score_2012 + region, data = global)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.2316 -0.8081  0.1177  0.7662  4.8123 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                        -0.032408   0.430078  -0.075  0.94003    
cpi_score_2019                      1.099116   0.059825  18.372  < 2e-16 ***
cpi_score_2018                     -0.093498   0.081324  -1.150  0.25206    
cpi_score_2017                      0.057108   0.081088   0.704  0.48233    
cpi_score_2016                     -0.124048   0.069728  -1.779  0.07722 .  
cpi_score_2015                      0.048585   0.059395   0.818  0.41463    
cpi_score_2013                      0.031481   0.065558   0.480  0.63177    
cpi_score_2012                     -0.019414   0.051693  -0.376  0.70777    
regionAsia Pacific                  0.079431   0.408301   0.195  0.84601    
regionEurope and Central Asia       1.250786   0.444322   2.815  0.00552 ** 
regionMiddle East and North Africa -0.008905   0.469914  -0.019  0.98491    
regionSub-Saharan Africa            0.126755   0.365382   0.347  0.72914    
regionWestern Europe               -0.189454   0.438926  -0.432  0.66662    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.44 on 153 degrees of freedom
  (14 observations deleted due to missingness)
Multiple R-squared:  0.9947,    Adjusted R-squared:  0.9943 
F-statistic:  2386 on 12 and 153 DF,  p-value: < 2.2e-16
plot(fit1)

P value

The p value of the CPI score for 2019 had 3 asterisks which mean that it was very meaningful to my model. Europe and Central Asia also had a significant p value which shows 2 asterisks.
The correlation was 0.99 which showed a strong correlation for cpi_score_2019.

The Adjusted R Squared

Both the multiple R-squared and the Adjusted R-squared were 99% which shows that 99% of the variation in the observation can be explained by the model and that the relationship is very meaningful.

There was no need for a backward elimination as it was already high at 99%.

Equation

The equation y = ax+b is y = 1.099(cpi_score_2019)+(-.032)

Create a visualization with the one that has the highest correlation

p1 <- ggplot(fit1,
               aes(x = cpi_score_2019, y = cpi_score_2020)) +
  geom_line(method   = "lm", formula  = y ~ x, color = "blue2") +
  labs(
    title   = "Does CPI score in 2019 predicts CPI score in 2020?",
    x       = "CPI Score 2019",
    y       = "CPI Score 2020",
    caption = "Source:Transparency International") +
    theme_minimal(base_size = 12) 
Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggplot2 package.
  Please report the issue at <https://github.com/tidyverse/ggplot2/issues>.
Warning in geom_line(method = "lm", formula = y ~ x, color = "blue2"): Ignoring
unknown parameters: `method` and `formula`
p1

  ggplot(global, aes(x = cpi_score_2019, y = cpi_score_2020, fill =region)) +
  geom_boxplot() +
  labs(
    x = "cpi_score_2019",
    y = "cpi_score_2020",
    title = "Relationship between CPI score 2019 and CPI score 2020 by Region",
    fill = "region",
    caption = "Source: Transparency International") + 
    scale_fill_manual(values = c("Asia Pacific" = "yellow","Western Europe" = "purple",
                               "Sub Saharan Africa" = "orange","Europe and Central Asia" = "red",
                               "Middle East and North Africa" = "blue", "Americas" = "pink")) +
  theme_minimal()
Warning: Orientation is not uniquely specified when both the x and y aesthetics are
continuous. Picking default orientation 'x'.

Conclusion essay

How I cleaned the dataset up

To clean my dataset for this project, I set all the variable names to lowercase to avoid keeping track of capitalizing. I also ensured that there are no spaces between variable names using gsub command, specifically the cpi scores and the ranks. I ran the head command again to look at the changes made. Other steps for cleaning included renaming the region codes to full names and using some inclusion / exclusion criteria to select desired columns to use for my regression model and removing na’s.

What the visualization represents, any interesting patterns or surprises that arise within the visualization.

The plot visualization did not surprise me as I knew that cpi score 2019 and 2020 had a very high correlation. As I looked at it, I see that indeed the correlation is very strong because the line steadily rises from left to right, showing that as one variable increases, the other variable also increases in a predictable manner.

Anything that you might have shown that you could not get to work or that you wished you could have included

I realized that my dataset was not really allowing me to do a very interesting analysis. I wished I had more diverse raw data to analyze specific variables such as gdp per capita, political systems, income levels, education attainment, and other factors to learn more about the relationship between cpi score and other factors. This project has also made me realize the importance of starting early to be able to take more time on it.