Libraries

library(ggplot2)
library(dplyr)
library(sjPlot)
library(wesanderson)
library(kableExtra)
library(ggsankey)

Dataset

Life expectancy & Socio-Economic (world bank). The researchers who published the dataset in Kaggle, originally used it to analyze sub-Saharan countries through the socio-economics and environmental factors which might affect their life expectancy. Using the dataset they attempted to answer these four following research questions:

  1. What’s the Impact of Expenditure on Health and Education (% of GDP) on Life Expectancy?
  2. How does the prevalence of undernourishment and communicable disease Affect Life Expectancy?
  3. Do factors like corruption and unemployment rate impact life expectancy? If yes, quantify
  4. Increase in CO2 emissions decrease life expectancy? Is it significant?

More about the dataset can be found here.

file <- read.csv("/Users/PaulMUTAMBA/Documents/Old Pc/HSE Social Info/4th Year/Advance Data Analysis/Data/life expectancy.csv")

DataFrame Creation

df <- file %>% select(Country.Name, Region, IncomeGroup, Year, Life_expectancy = Life.Expectancy.World.Bank, Unemployment, CO2) %>% filter(Year == 2009 | Year == 2018)

df_le_region <- df %>% filter (Year == 2018) %>%  group_by(Region) %>% summarise(Mean_Life_Expectancy = median(Life_expectancy, na.rm = T))

Average Life Expectancy per Region in 2018

ggplot(df_le_region, aes(x = Region, y = Mean_Life_Expectancy, fill = Region)) + 
  geom_bar(stat = "identity") +   theme(text = element_text(size=8), axis.text.x = element_text(angle=90, hjust=1)) +
  labs(fill = "Region", x = "Region", y = "Averages Life Expectancy", caption = "Data source: Life expectancy & Socio-Economic (world bank)") + coord_flip() + theme(plot.caption = element_text(hjust = 0.5)) + scale_fill_brewer(palette="Dark2")

Unployement’s Distribution rate in 2018 in Sub-Saharian Africa

ggplot(data = df %>% filter(Year == 2018, Region == "Sub-Saharan Africa" ), aes(x = Unemployment)) + geom_histogram(binwidth = 3, color="black", fill= "#FFB273") + geom_vline(aes(xintercept= mean(Unemployment, na.rm = T), color = "mean"), linetype="dashed", size=1) + labs(x = "Unemployment", caption = "Unemployment refers to the % share of the labor force that is without  work \n but available for and seeking employment", fills = " ") + theme(plot.title = element_text(hjust = 0.5), plot.caption = element_text(hjust = 0.25, size = 0.25)) + theme_bw() + scale_color_manual(name = "Measurement", values = c(mean = "#824acd")) 

Life Expectancy vs CO2 Consumption of Sub-Saharian Africa and Middle East & North Africa

ggplot(df %>% filter(Year == 2018, Region %in% c("Sub-Saharan Africa", "Middle East & North Africa") ), aes ( x = log10(CO2), y = Life_expectancy)) + geom_point() +   stat_smooth(method = "lm", col = "#C42126", se = FALSE, size = 1.5) + labs(x = "CO2 emissions - log 10", y = "Life Expectancy", caption = "Data source: Life expectancy & Socio-Economic (world bank)") + theme_bw()

CO2 Emissions per Region

ggplot(data = df %>% filter(Region %in% c("Sub-Saharan Africa", "Middle East & North Africa", "South Asia")), aes(x= Region, y= log10(CO2), fill= as.factor(Year))) +
  scale_fill_brewer(palette="Dark2") +
  geom_boxplot(alpha=0.5, notch = TRUE) + 
  xlab("Region") +
  ylab("CO2 Emission - Log10") +
  labs(caption = "Data source: Life expectancy & Socio-Economic (world bank)", fill = "Year") +
  theme(plot.caption = element_text(hjust = 0.5)) 

Average CO2 Consumption of different Income Group per Region in 2018

sliced_df <-  df %>% filter(Year == 2018) %>% group_by(Region, IncomeGroup, Year) %>% summarise(avg_CO2_Consumption = mean(CO2, na.rm = T))
kable(sliced_df) %>% 
  kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)
Region IncomeGroup Year avg_CO2_Consumption
East Asia & Pacific High income 2018 316608.007
East Asia & Pacific Lower middle income 2018 98067.273
East Asia & Pacific Upper middle income 2018 1375827.460
Europe & Central Asia High income 2018 111568.573
Europe & Central Asia Lower middle income 2018 102256.666
Europe & Central Asia Upper middle income 2018 40745.386
Latin America & Caribbean High income 2018 20456.667
Latin America & Caribbean Lower middle income 2018 9698.000
Latin America & Caribbean Upper middle income 2018 78734.705
Middle East & North Africa High income 2018 130653.750
Middle East & North Africa Lower middle income 2018 57715.999
Middle East & North Africa Upper middle income 2018 81516.665
North America High income 2018 2777700.043
South Asia Low income 2018 6070.000
South Asia Lower middle income 2018 460103.321
South Asia Upper middle income 2018 2100.000
Sub-Saharan Africa High income 2018 580.000
Sub-Saharan Africa Low income 2018 4149.545
Sub-Saharan Africa Lower middle income 2018 15836.667
Sub-Saharan Africa Upper middle income 2018 76928.334

GGSankey Plot

I found GGSankey on twitter using the hashtag #tidytuesday. A few years ago I used to follow #tidytuesday on twitter (now “X”) for ploting inspiration. So for the assignment, I looked at recent tweets that used the hashtag to see if there were any new techniques I could, and would, apply. One of the tweets used GGSankey - a completely new package and technique for me - to analyse the flow of immigration from Ukraine, Afghanistan and Syria to Germany, Iran and Turkey. As this package was new to me, I decided to apply it to current homework. Although GGSankey is based on the GGplot2 package, it uses a very different technique to initialise the aesthetics. GGSankey requires three ‘special’ arguments in the aesthetic - next_x, node, next_node. More information on using GGSankey can be found here.

I applied the package to my case to visualise the relationship between regions, income group and CO2 consumption (which I turned into a character, as GGSankey only works with character or factor variables).

#summary(sliced_df$avg_CO2_Consumption)

sliced_df$CO2_Consumption <- case_when(
  sliced_df$avg_CO2_Consumption < 14303 ~ "Very Low Consumption",
  sliced_df$avg_CO2_Consumption < 77833 ~ "Low Consumption",
  sliced_df$avg_CO2_Consumption < 288366 ~ "Average Consumption",
  sliced_df$avg_CO2_Consumption <=  2777700 ~ "High Consumption")

sliced_df2 <- sliced_df %>% ungroup()

sliced_df2 <- na.omit(sliced_df2)

Region- Income_Group - CO2 Consumption

sliced_df2$CO2_Consumption1 <- factor(sliced_df2$CO2_Consumption, 
                                 levels = c("Very Low Consumption", "Low Consumption", "Average Consumption", "High Consumption"))

#levels(sliced_df2$CO2_Consumption)

sliced_df2long <- sliced_df2 %>% make_long(Region, CO2_Consumption1, IncomeGroup)
dagg <- sliced_df2long %>% dplyr::group_by(node) %>% tally()
df2 <- merge(sliced_df2long, dagg, by.x = 'node', by.y = 'node', all.x = TRUE)

pl <- ggplot(df2, aes(x = x, next_x = next_x, node = node, next_node = next_node, fill = factor(node), label = paste0(node, " n=", n)))

pl <- pl + geom_sankey(flow.alpha = 0.5,
                       node.color = "black",
                       show.legend = TRUE)
pl <- pl + geom_sankey_label(size = 3, color = "black", fill = "white")
pl <- pl + theme_bw()
pl <- pl + theme(legend.position = "none", axis.title = element_blank(), axis.text.y = element_blank(),
                 axis.ticks = element_blank(),
                 panel.grid = element_blank())
pl <- pl + labs(title = "Relationship between Region, Income group, and CO2 consumption", Caption = "Plot using ggsanky", fill = 'Nodes')

pl

It has to be said that GGSankey is not the most suitable graph for our purposes. This type of graph is better used when we want to visualise the flow of movements, such as immigration, sales or cooperation, from one set of nodes (countries, companies, regions, etc.) to another set of nodes.