The goal of this guide is to show how you can make a similar graph as the image below. I will walk you through how
to create the graph using ggplot2 and how we can add more functionality for the user by making it interactive. I will
also highlight a few things that we need to alter with our code to make this easier!

If you would like the data set for this, feel free to send me an email:

Click Here to jump down to create the graph in the image below!

Click Here to jump down to the interactive version!


Reference Graph

This data set is from my Data Science and Machine Learning with R Course. The original assignment was to make a scatter plot similar to the image above.


Loading in the Dataset

my.data <- read.csv("Economist_Assignment_Data.csv")
head(my.data)
##   X     Country HDI.Rank   HDI CPI            Region
## 1 1 Afghanistan      172 0.398 1.5      Asia Pacific
## 2 2     Albania       70 0.739 3.1 East EU Cemt Asia
## 3 3     Algeria       96 0.698 2.9              MENA
## 4 4      Angola      148 0.486 2.0               SSA
## 5 5   Argentina       45 0.797 3.0          Americas
## 6 6     Armenia       86 0.716 2.6 East EU Cemt Asia
# Removing the first column since it's just an integer
my.data <- my.data[, -1]

# Fixing the typo
my.data$Region[my.data$Region == "East EU Cemt Asia"] <- "East EU Cent Asia"


Walkthrough to create the original graph

Load ggplot2

library(ggplot2)

Create the base scatter plot

sp <- ggplot(my.data, aes(x = CPI, y = HDI))
sp + geom_point(aes(color = Region))

This looks nice by itself! Let’s change those dots to open circles and add a trend line.

sp + geom_point(aes(color = Region), shape = 21, size = 4) + # This changes the shape of the data points
  
  # stat_smooth() will add our trend line
  stat_smooth(method = "lm", formula = y ~ log(x), se = FALSE, color = "red") 

Now we need to add the country names to the data points as they did in the original graph. To avoid having every data point labeled, we need to perform an extra step!

# These are all the countries we want labeled
pointsToLabel <- c("Russia", "Venezuela", "Iraq", "Myanmar", "Sudan",
                   "Afghanistan", "Congo", "Greece", "Argentina", "Brazil",
                   "India", "Italy", "China", "South Africa", "Spane",
                   "Botswana", "Cape Verde", "Bhutan", "Rwanda", "France",
                   "United States", "Germany", "Britain", "Barbados", "Norway", "Japan",
                   "New Zealand", "Singapore")

sp + geom_point(aes(color = Region), shape = 21, size = 4) +
  stat_smooth(method = "lm", formula = y ~ log(x), se = FALSE, color = "red") +
  
  # This will label the countries a single time and ensure there's no overlapping of words
  geom_text(aes(label = Country), color = "gray20", 
            data = subset(my.data, Country %in% pointsToLabel),check_overlap = TRUE)

As you can probably see, we have a bit of a scaling issue. Let’s fix that and adjust our axis labels to something more meaningful as well.

sp + geom_point(aes(color = Region), shape = 21, size = 4) +
  stat_smooth(method = "lm", formula = y ~ log(x), se = FALSE, color = "red") +
  geom_text(aes(label = Country), color = "gray20", 
            data = subset(my.data, Country %in% pointsToLabel),check_overlap = TRUE) + 
  
  # Formatting our x-axis
  scale_x_continuous(name = "Corruption Perceptions Index, 2011, (10 = least corrupt)",
                     limits = c(1,10), breaks = 1:10) + 
  
  # Formatting our y-axis
  scale_y_continuous(name = "Human Development Index, 2011 (1 = best)", limits = c(0.2, 1)) + 
  labs(title = "Corruption and Human Development") +
  
  # Changing the theme to a white background
  theme_bw()

This looks great on it’s own, but let’s see how plotly can make this even better!