Data Visualization Project

A correlation between corruption and development

The purpose of this project is to perform data visualization to explore the relationship between Corruption and Human Development across various nations based on UN Human Development Report. The data for the project is taken from an article ‘Corrosive corruption’ published in The Economist.

1. Load and Check Data

# Load the libraries
library(ggplot2)
library(ggthemes)
library(data.table)

# Load the data
df <- fread('Economist_Assignment_Data.csv', drop=1)
str(df)
## Classes 'data.table' and 'data.frame':   173 obs. of  5 variables:
##  $ Country : chr  "Afghanistan" "Albania" "Algeria" "Angola" ...
##  $ HDI.Rank: int  172 70 96 148 45 86 2 19 91 53 ...
##  $ HDI     : num  0.398 0.739 0.698 0.486 0.797 0.716 0.929 0.885 0.7 0.771 ...
##  $ CPI     : num  1.5 3.1 2.9 2 3 2.6 8.8 7.8 2.4 7.3 ...
##  $ Region  : chr  "Asia Pacific" "East EU Cemt Asia" "MENA" "SSA" ...
##  - attr(*, ".internal.selfref")=<externalptr>
The dataset has 173 observations of 5 variables. The variables in the dataset are:
  1. Country
  2. HDI.Rank - Human Development Index Rank
  3. HDI - Human Development Index
  4. CPI - Corruption Perception Index
  5. Region

1.1 Scatter Plot of CPI and HDI

Let’s create a scatter plot object called pl. We will specify x=CPI and y=HDI and color=Region as aesthetics.

pl <- ggplot(df, aes(CPI, HDI, color=Region)) + geom_point()
pl

We can see a plot of HDI vs CPI. The points are colored by region.

1.2 Update Points on Chart

Let’s change the points to be larger empty circles.

pl <- ggplot(df, aes(CPI, HDI, color=Region)) + geom_point(shape=1, size=4)
pl

1.3 Add a Trendline

Let’s add a single line of best fit

pl2 <- pl + geom_smooth(aes(group=1))
pl2
## `geom_smooth()` using method = 'loess'

1.4 Edit to Smooth

Let’s further smooth the trendline

pl2 <- pl + geom_smooth(aes(group=1), method='lm', formula=y~log(x), 
                        se=FALSE, color='red')
pl2

1.5 Add Labels

Let’s add labels for countries to the plot

pl3 <- pl2 + geom_text(aes(label = Country))
pl3

Observation - We can see that there are way too many labels. Let’s clean this plot.

1.6 Add Specific Labels

Let’s add labels only for certain countries to make the plot easily understandable.

# Select specific countries
pointsToLabel <- c("Russia", "Venezuela", "Iraq", "Myanmar", "Sudan",
                   "Afghanistan", "Congo", "Greece", "Argentina", "Brazil",
                   "India", "Italy", "China", "South Africa", "Spane",
                   "Botswana", "Cape Verde", "Bhutan", "Rwanda", "France",
                   "United States", "Germany", "Britain", "Barbados", "Norway", "Japan", "New Zealand", "Singapore", "Cuba")

# Add countries to plot
pl3 <- pl2 + geom_text(aes(label=Country), color='gray20',
                       data=subset(df, Country %in% pointsToLabel),
                       check_overlap = TRUE)
pl3

1.7 Add Theme

Let’s add theme to the plot

pl4 <- pl3 + theme_bw()
pl4

1.7 Add Limits and Title

Let’s add x and y axes, limits and title to the plot

pl5 <- pl4 + scale_x_continuous(limits = c(.9, 10.5), breaks=1:10) +
      scale_y_continuous(limits = c(.2, 1.0)) + 
      labs(x="Corruption Perceptions Index, 2011 (10=least corrupt)",
           y="Human Development Index, 2011 (1=Best)",
           title='Corruption and Human Development')
pl5

Observation: Comparing the corruption index with Human Development Index (a measure combining health, wealth and education), we can see an interesting connection:
  1. When the corruption index is between approximately 2.0 and 4.0 there appears to be little relationship with the human development index.
  2. As corruption index rises beyond 4.0 a stronger connection can be seen.
  3. Outliers include small but well-run poorer countries such as Bhutan and Cape Verde, while Greece and Italy stand out among the richer countries.