We will try to replicate one of The Economist plot, which tells us about the decrease of South Korean women fertility rates. You can look the articles here here .
Well, R is just a tool, just like other data visualization tools like Tableau or even Excel. But what makes R powerful is that it is FREE and it is REPRODUCIBLE. We can build a simple graph with Excel, but you can’t reproduce it in an efficient way. You have to go through the same steps, same clicks, everytime you want to reproduce a graph. In R, you just have to build the plot once, and you can recreate it as many as you want.
Another think to consider is that R has this great package, named ggplot2
, which has many varieties of plot and supporting elements that you can use create a beautiful and charming visualization. To show you the power of R, I will recreate one of The Economist plot, which is acknowledged due to the simple yet elegant plot.
Before we start our journey, first we must load the required libraries.
library(tidyverse)
library(scales)
library(ggrepel)
library(ggthemes)
library(grid)
options(scipen = 100)
The options()
function means to set the scientific notation.
The data is acquired from World Bank
fertility rate data: https://data.worldbank.org/indicator/SP.DYN.TFRT.IN
population data: https://data.worldbank.org/indicator/SP.POP.TOTL?end=2018&start=1996
gdp per capita data: https://data.worldbank.org/indicator/NY.GDP.PCAP.CD
data_tfr <- read_csv("data/world_bank_fertility.csv",skip = 4)
## Warning: Missing column names filled in: 'X64' [64]
## Parsed with column specification:
## cols(
## .default = col_double(),
## `Country Name` = col_character(),
## `Country Code` = col_character(),
## `Indicator Name` = col_character(),
## `Indicator Code` = col_character(),
## `2018` = col_logical(),
## X64 = col_logical()
## )
## See spec(...) for full column specifications.
data_tfr
data_pop <- read_csv("data/world_bank_pop.csv",skip = 4)
## Warning: Missing column names filled in: 'X64' [64]
## Parsed with column specification:
## cols(
## .default = col_double(),
## `Country Name` = col_character(),
## `Country Code` = col_character(),
## `Indicator Name` = col_character(),
## `Indicator Code` = col_character(),
## X64 = col_logical()
## )
## See spec(...) for full column specifications.
data_gdp <- read_csv("data/world_bank_gdp.csv",skip = 4)
## Warning: Missing column names filled in: 'X64' [64]
## Parsed with column specification:
## cols(
## .default = col_double(),
## `Country Name` = col_character(),
## `Country Code` = col_character(),
## `Indicator Name` = col_character(),
## `Indicator Code` = col_character(),
## X64 = col_logical()
## )
## See spec(...) for full column specifications.
Look like we have an empty column on 2018
and X64
. Don’t worry, we will not use them anyway.
Since the article only use the population data from 2017, we will clean the data first.
data_tfr <- data_tfr %>%
select(`Country Name`,'2017') %>%
rename(tfr = "2017")
data_pop <- data_pop %>%
select(`Country Name`,'2017') %>%
rename(pop = "2017")
data_gdp <- data_gdp %>%
select(`Country Name`,'2017') %>%
rename(gdp = "2017")
head(data_tfr,10)
Next, we will combine the data
df <- data_tfr %>%
left_join(data_pop) %>%
left_join(data_gdp) %>%
na.omit() %>%
rename(country = `Country Name`) %>%
arrange(desc(pop))
## Joining, by = "Country Name"
## Joining, by = "Country Name"
head(df,40)
Look like we have some region/cluster of countries in the data, so we need to remove them.
df <- df[-c(1:15,18:43),]
head(df,10)
Now we are ready to make the plot.
First, we plot the data into scatter plot, with the size of the dot indicate the population size.
p <- df %>%
ggplot()+
geom_point(aes(gdp,tfr,size=pop),color="#2fc1d3",alpha=.5,show.legend = F)
p
We will smooth the data using the log10 scale
p <- p+
scale_x_continuous(trans = "log10",
expand = c(0.05,0),
labels = number_format(big.mark = ","),limits = c(100,110000))
p
Now, we rescale the size of the point.
p <- p+
scale_size_continuous(range = c(3,16),guide = F)
p
Next, we scale the y axis
p <- p+
scale_y_continuous(breaks = seq(0,8,2),limits = c(0,8.3),position = "right",expand = c(0,0))
p
We clean the background and the grid.
p <- p+
theme_economist()+
theme(panel.grid.major.y = element_line(colour = "gray80"),
panel.grid.minor.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.background = element_blank(),
plot.background = element_blank())
p
Now we adjust the tick mark and the position of the y-axis text
p <- p+
theme(axis.ticks.length.x = unit(2,"mm"),
axis.text.y = element_text(vjust = 0))
p
We will remove the y-axis title and rename the x-axis title. Also, we will add the title and caption.
p <- p+
theme(axis.title.y = element_blank(),
axis.title.x = element_text(colour = "black",size = 12),
plot.title = element_text(size = 14,face = "bold",color="black"),
plot.subtitle = element_text(size=12,color="black"),
plot.caption = element_text(size=10,color="gray30",hjust = 0))+
labs(title = "Gone baby gone",
subtitle = "GDP and fertility, 2017 Fertility rate",
caption = "Source: World Bank",
x = "GDP per capita, $, log scale")
p
We will add the annotation of population size and the replacement fertility level
p <- p+
geom_hline(aes(yintercept=2.1),color="#edb0ad",linetype="dashed",size=1)+
geom_text(aes(100,2.4,label = "Replacement fertility level"),
color="#e07b78",hjust="left",size=4.5)+
geom_text(aes(100,.5,label = "Circle size = Population, 2017"),
color="#8e9093",hjust="left",size=4)
p
Now we will highlights and add annotation for selected countries. Since the annotation line is in elbow shape, we need to be creative. To my best knowledge, there is no packages that can directly make an elbow annotation. I’ll show you how to make one.
First, we add the Niger.
df_niger <- df %>% filter(country == "Niger")
p <- p+
geom_text_repel(aes(gdp,tfr,label=country),data = df_niger,nudge_x = -.15,direction = "x")+
geom_point(aes(gdp,tfr,size=pop),data = df_niger,shape=21,fill="#2fc1d3",color="black")
p
Second, we add the India and China
df_inch <- df %>% filter(country %in% c("India","China"))
p <- p+
geom_segment(aes(x=gdp,xend=gdp,y=tfr,yend=tfr-.7),data=df_inch)+
geom_text_repel(aes(gdp,tfr-.7,label=country),data=df_inch,nudge_x = -.3,direction = "x")+
geom_point(aes(gdp,tfr,size=pop),data=df_inch,shape=21,fill="#2fc1d3",color="black")
p
Next, we add Japan and The United States
df_japus <- df %>% filter(country %in% c("Japan","United States"))
p <- p+
geom_text_repel(aes(gdp,tfr,label=country),data=df_japus,nudge_y = 1.5,direction = "y")+
geom_point(aes(gdp,tfr,size=pop),data=df_japus,shape=21,fill="#2fc1d3",color="black")
p
We add Hongkong
df_hongkong <- df %>% filter(country == "Hong Kong SAR, China")
df_hongkong$country <- "Hong kong"
p <- p+
geom_text_repel(aes(gdp,tfr,label=country),data=df_hongkong,nudge_y = -.7,direction = "y")+
geom_point(aes(gdp,tfr,size=pop),data=df_hongkong,shape=21,fill="#2fc1d3",color="black")
p
We add Singapore
df_sing <- df %>% filter(country == "Singapore")
p <- p+
geom_segment(aes(x=gdp,xend=gdp,y=tfr,yend=tfr-.5),data=df_sing,color="black")+
geom_text_repel(aes(gdp,tfr-.5,label=country),data=df_sing,color="black",nudge_x = 0.2,direction = "x")+
geom_point(aes(gdp,tfr,size=pop),data=df_sing,shape=21,fill="#2fc1d3",color="black")
p
Finally, we add South Korea
df_kor <- df %>% filter(country == "Korea, Rep.")
df_kor$country <- "South Korea"
p <- p+
geom_segment(aes(x=gdp,xend=gdp,y=tfr,yend=tfr-.5),data=df_kor)+
geom_text_repel(aes(gdp,tfr-.5,label=country),data=df_kor,nudge_x = -.3,direction = "x",fontface="bold")+
geom_point(aes(gdp,tfr,size=pop),data=df_kor,shape=21,fill="#2fc1d3",color="black")
p
To put cerry on the top, we will add the unique red header of The Economist plot.
png("economist_plot.png", width = 9, height = 6, units = "in", res = 300)
p
#Add red rectangle on top left side
grid.rect(x=1, y=.995,hjust = 1,vjust=0,gp=gpar(fill='#e5001c',lwd=0))
grid.rect(x=0.05, y=.98,hjust = 1,vjust=0,gp=gpar(fill='#e5001c',lwd=0))
dev.off()
## png
## 2
Our plot is ready.