The dataset I chose is about Global Corruption around the world specifically presented with 34 variblaes which include the Corruption Perceptions Index (CPI) score, Standard Error, Rank, and Sources from 2012 to 2020.
For this analysis below, my main question is to explore whether the CPI from previous year helps predict the CPI from 2020? CPI is the most widely used global corruption ranking in the world and measures how currupt a country’s public sector is perceived to be.
The source of my dataset is from Transparency International which mission is to stop corruptio and promote transparency, accountability and integrity at all levels and across all sectors of society. https://www.transparency.org/en/about
Load the library packages
The first step for this analysis was to ensure that all the library packages needed are loaded. To do this, I have created a chunk and used library() with the specific packages I need and then I ran the chunk to load them. Packages are fundamental units of reproducible R code and include reusable functions, documentation that describe how to use them and sample data.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(zoo)
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
library(ggfortify)library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
The third step was to load the globalcorruptiondataset from my working directory using setwd and < - read_csv. After that I used the function head to explore the see the first few lines of the dataset and the various columns. I was able to add the dataset in the global environment and see it clearly.
Rows: 180 Columns: 34
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Country, ISO3, Region
dbl (31): CPI score 2020, Rank 2020, Sources 2020, Standard error 2020, CPI ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
My fourth step was to clean the data by setting all the variable names to lowercase to avoid keeping track of capitalizing. I also ensure that there are no spaces between variable names. I ran head again to look at the changes made.
My fifth step was to rename the region codes to full names.
globalcorruptiondataset$region[globalcorruptiondataset$region =="WE/EU"] <-"Western Europe"globalcorruptiondataset$region[globalcorruptiondataset$region =="AP"] <-"Asia Pacific"globalcorruptiondataset$region[globalcorruptiondataset$region =="MENA"] <-"Middle East and North Africa"globalcorruptiondataset$region[globalcorruptiondataset$region =="ECA"] <-"Europe and Central Asia"globalcorruptiondataset$region[globalcorruptiondataset$region =="SSA"] <-"Sub-Saharan Africa"globalcorruptiondataset$region[globalcorruptiondataset$region =="AME"] <-"Americas"
Inclusion / exclusion criteria
My sixth step was to use some inclusion / exclusion criteria to select desired columns to use for my regression model. I wanted to select Western Europe and Sub Saharan Africa as regions as well as the cpi_score from 2012 to 2020. I also want to remove na’s by filtering.
In this step, I want to look at whether the CPI from previous year helps predict the CPI from 2020. I do this by using the command lm(y ~ x) to fit the predictor variable x into the model to predict y. I have picked all the variables and conducted a multiple linear.
The column Pr(>|t|) p-value describes whether the predictor is useful to the model. The more asterisks, the more the variable contributes to the model. The adjusted r squared shows me whether the variation is explained by the model.
The correlation is a metric and has a scale from -1 to 1 where -1 to 0 shows a negative correlation, 0 shows no correlation and 0 to 1 shows some moderate to strong correlation.
Once I find the highest correlation, I will create a visualization with the highest correlation.
Call:
lm(formula = cpi_score_2020 ~ cpi_score_2019 + cpi_score_2018 +
cpi_score_2017 + cpi_score_2016 + cpi_score_2015 + cpi_score_2013 +
cpi_score_2012 + region, data = global)
Residuals:
Min 1Q Median 3Q Max
-5.2316 -0.8081 0.1177 0.7662 4.8123
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.032408 0.430078 -0.075 0.94003
cpi_score_2019 1.099116 0.059825 18.372 < 2e-16 ***
cpi_score_2018 -0.093498 0.081324 -1.150 0.25206
cpi_score_2017 0.057108 0.081088 0.704 0.48233
cpi_score_2016 -0.124048 0.069728 -1.779 0.07722 .
cpi_score_2015 0.048585 0.059395 0.818 0.41463
cpi_score_2013 0.031481 0.065558 0.480 0.63177
cpi_score_2012 -0.019414 0.051693 -0.376 0.70777
regionAsia Pacific 0.079431 0.408301 0.195 0.84601
regionEurope and Central Asia 1.250786 0.444322 2.815 0.00552 **
regionMiddle East and North Africa -0.008905 0.469914 -0.019 0.98491
regionSub-Saharan Africa 0.126755 0.365382 0.347 0.72914
regionWestern Europe -0.189454 0.438926 -0.432 0.66662
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.44 on 153 degrees of freedom
(14 observations deleted due to missingness)
Multiple R-squared: 0.9947, Adjusted R-squared: 0.9943
F-statistic: 2386 on 12 and 153 DF, p-value: < 2.2e-16
plot(fit1)
P value
The p value of the CPI score for 2019 had 3 asterisks which mean that it was very meaningful to my model. Europe and Central Asia also had a significant p value which shows 2 asterisks.
The correlation was 0.99 which showed a strong correlation for cpi_score_2019.
The Adjusted R Squared
Both the multiple R-squared and the Adjusted R-squared were 99% which shows that 99% of the variation in the observation can be explained by the model and that the relationship is very meaningful.
There was no need for a backward elimination as it was already high at 99%.
Equation
The equation y = ax+b is y = 1.099(cpi_score_2019)+(-.032)
Create a visualization with the one that has the highest correlation
p1 <-ggplot(fit1,aes(x = cpi_score_2019, y = cpi_score_2020)) +geom_line(method ="lm", formula = y ~ x, color ="blue2") +labs(title ="Does CPI score in 2019 predicts CPI score in 2020?",x ="CPI Score 2019",y ="CPI Score 2020",caption ="Source:Transparency International") +theme_minimal(base_size =12)
Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggplot2 package.
Please report the issue at <https://github.com/tidyverse/ggplot2/issues>.
Warning in geom_line(method = "lm", formula = y ~ x, color = "blue2"): Ignoring
unknown parameters: `method` and `formula`
p1
ggplot(global, aes(x = cpi_score_2019, y = cpi_score_2020, fill =region)) +geom_boxplot() +labs(x ="cpi_score_2019",y ="cpi_score_2020",title ="Relationship between CPI score 2019 and CPI score 2020 by Region",fill ="region",caption ="Source: Transparency International") +scale_fill_manual(values =c("Asia Pacific"="yellow","Western Europe"="purple","Sub Saharan Africa"="orange","Europe and Central Asia"="red","Middle East and North Africa"="blue", "Americas"="pink")) +theme_minimal()
Warning: Orientation is not uniquely specified when both the x and y aesthetics are
continuous. Picking default orientation 'x'.
Conclusion essay
How I cleaned the dataset up
To clean my dataset for this project, I set all the variable names to lowercase to avoid keeping track of capitalizing. I also ensured that there are no spaces between variable names using gsub command, specifically the cpi scores and the ranks. I ran the head command again to look at the changes made. Other steps for cleaning included renaming the region codes to full names and using some inclusion / exclusion criteria to select desired columns to use for my regression model and removing na’s.
What the visualization represents, any interesting patterns or surprises that arise within the visualization.
The plot visualization did not surprise me as I knew that cpi score 2019 and 2020 had a very high correlation. As I looked at it, I see that indeed the correlation is very strong because the line steadily rises from left to right, showing that as one variable increases, the other variable also increases in a predictable manner.
Anything that you might have shown that you could not get to work or that you wished you could have included
I realized that my dataset was not really allowing me to do a very interesting analysis. I wished I had more diverse raw data to analyze specific variables such as gdp per capita, political systems, income levels, education attainment, and other factors to learn more about the relationship between cpi score and other factors. This project has also made me realize the importance of starting early to be able to take more time on it.