Visualization

Libraries

library(ggplot2)
library(dplyr)
library(sjPlot)
library(wesanderson)
library(kableExtra)
library(ggsankey)

Dataset

Life expectancy & Socio-Economic (world bank). The researchers who published the dataset in Kaggle, originally used it to analyze sub-Saharan countries through the socio-economics and environmental factors which might affect their life expectancy. Using the dataset they attempted to answer these four following research questions:

What’s the Impact of Expenditure on Health and Education (% of GDP) on Life Expectancy?
How does the prevalence of undernourishment and communicable disease Affect Life Expectancy?
Do factors like corruption and unemployment rate impact life expectancy? If yes, quantify
Increase in CO2 emissions decrease life expectancy? Is it significant?

More about the dataset can be found here.

file <- read.csv("/Users/PaulMUTAMBA/Documents/Old Pc/HSE Social Info/4th Year/Advance Data Analysis/Data/life expectancy.csv")

DataFrame Creation

df <- file %>% select(Country.Name, Region, IncomeGroup, Year, Life_expectancy = Life.Expectancy.World.Bank, Unemployment, CO2) %>% filter(Year == 2009 | Year == 2018)

df_le_region <- df %>% filter (Year == 2018) %>%  group_by(Region) %>% summarise(Mean_Life_Expectancy = median(Life_expectancy, na.rm = T))

Average Life Expectancy per Region in 2018

ggplot(df_le_region, aes(x = Region, y = Mean_Life_Expectancy, fill = Region)) + 
  geom_bar(stat = "identity") +   theme(text = element_text(size=8), axis.text.x = element_text(angle=90, hjust=1)) +
  labs(fill = "Region", x = "Region", y = "Averages Life Expectancy", caption = "Data source: Life expectancy & Socio-Economic (world bank)") + coord_flip() + theme(plot.caption = element_text(hjust = 0.5)) + scale_fill_brewer(palette="Dark2")

Unployement’s Distribution rate in 2018 in Sub-Saharian Africa

ggplot(data = df %>% filter(Year == 2018, Region == "Sub-Saharan Africa" ), aes(x = Unemployment)) + geom_histogram(binwidth = 3, color="black", fill= "#FFB273") + geom_vline(aes(xintercept= mean(Unemployment, na.rm = T), color = "mean"), linetype="dashed", size=1) + labs(x = "Unemployment", caption = "Unemployment refers to the % share of the labor force that is without  work \n but available for and seeking employment", fills = " ") + theme(plot.title = element_text(hjust = 0.5), plot.caption = element_text(hjust = 0.25, size = 0.25)) + theme_bw() + scale_color_manual(name = "Measurement", values = c(mean = "#824acd"))

Life Expectancy vs CO2 Consumption of Sub-Saharian Africa and Middle East & North Africa

ggplot(df %>% filter(Year == 2018, Region %in% c("Sub-Saharan Africa", "Middle East & North Africa") ), aes ( x = log10(CO2), y = Life_expectancy)) + geom_point() +   stat_smooth(method = "lm", col = "#C42126", se = FALSE, size = 1.5) + labs(x = "CO2 emissions - log 10", y = "Life Expectancy", caption = "Data source: Life expectancy & Socio-Economic (world bank)") + theme_bw()

CO2 Emissions per Region

ggplot(data = df %>% filter(Region %in% c("Sub-Saharan Africa", "Middle East & North Africa", "South Asia")), aes(x= Region, y= log10(CO2), fill= as.factor(Year))) +
  scale_fill_brewer(palette="Dark2") +
  geom_boxplot(alpha=0.5, notch = TRUE) + 
  xlab("Region") +
  ylab("CO2 Emission - Log10") +
  labs(caption = "Data source: Life expectancy & Socio-Economic (world bank)", fill = "Year") +
  theme(plot.caption = element_text(hjust = 0.5))

Average CO2 Consumption of different Income Group per Region in 2018

sliced_df <-  df %>% filter(Year == 2018) %>% group_by(Region, IncomeGroup, Year) %>% summarise(avg_CO2_Consumption = mean(CO2, na.rm = T))
kable(sliced_df) %>% 
  kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)

Region	IncomeGroup	Year	avg_CO2_Consumption
East Asia & Pacific	High income	2018	316608.007
East Asia & Pacific	Lower middle income	2018	98067.273
East Asia & Pacific	Upper middle income	2018	1375827.460
Europe & Central Asia	High income	2018	111568.573
Europe & Central Asia	Lower middle income	2018	102256.666
Europe & Central Asia	Upper middle income	2018	40745.386
Latin America & Caribbean	High income	2018	20456.667
Latin America & Caribbean	Lower middle income	2018	9698.000
Latin America & Caribbean	Upper middle income	2018	78734.705
Middle East & North Africa	High income	2018	130653.750
Middle East & North Africa	Lower middle income	2018	57715.999
Middle East & North Africa	Upper middle income	2018	81516.665
North America	High income	2018	2777700.043
South Asia	Low income	2018	6070.000
South Asia	Lower middle income	2018	460103.321
South Asia	Upper middle income	2018	2100.000
Sub-Saharan Africa	High income	2018	580.000
Sub-Saharan Africa	Low income	2018	4149.545
Sub-Saharan Africa	Lower middle income	2018	15836.667
Sub-Saharan Africa	Upper middle income	2018	76928.334

GGSankey Plot

I found GGSankey on twitter using the hashtag #tidytuesday. A few years ago I used to follow #tidytuesday on twitter (now “X”) for ploting inspiration. So for the assignment, I looked at recent tweets that used the hashtag to see if there were any new techniques I could, and would, apply. One of the tweets used GGSankey - a completely new package and technique for me - to analyse the flow of immigration from Ukraine, Afghanistan and Syria to Germany, Iran and Turkey. As this package was new to me, I decided to apply it to current homework. Although GGSankey is based on the GGplot2 package, it uses a very different technique to initialise the aesthetics. GGSankey requires three ‘special’ arguments in the aesthetic - next_x, node, next_node. More information on using GGSankey can be found here.

I applied the package to my case to visualise the relationship between regions, income group and CO2 consumption (which I turned into a character, as GGSankey only works with character or factor variables).

#summary(sliced_df$avg_CO2_Consumption)

sliced_df$CO2_Consumption <- case_when(
  sliced_df$avg_CO2_Consumption < 14303 ~ "Very Low Consumption",
  sliced_df$avg_CO2_Consumption < 77833 ~ "Low Consumption",
  sliced_df$avg_CO2_Consumption < 288366 ~ "Average Consumption",
  sliced_df$avg_CO2_Consumption <=  2777700 ~ "High Consumption")

sliced_df2 <- sliced_df %>% ungroup()

sliced_df2 <- na.omit(sliced_df2)

Region- Income_Group - CO2 Consumption

sliced_df2$CO2_Consumption1 <- factor(sliced_df2$CO2_Consumption, 
                                 levels = c("Very Low Consumption", "Low Consumption", "Average Consumption", "High Consumption"))

#levels(sliced_df2$CO2_Consumption)

sliced_df2long <- sliced_df2 %>% make_long(Region, CO2_Consumption1, IncomeGroup)
dagg <- sliced_df2long %>% dplyr::group_by(node) %>% tally()
df2 <- merge(sliced_df2long, dagg, by.x = 'node', by.y = 'node', all.x = TRUE)

pl <- ggplot(df2, aes(x = x, next_x = next_x, node = node, next_node = next_node, fill = factor(node), label = paste0(node, " n=", n)))

pl <- pl + geom_sankey(flow.alpha = 0.5,
                       node.color = "black",
                       show.legend = TRUE)
pl <- pl + geom_sankey_label(size = 3, color = "black", fill = "white")
pl <- pl + theme_bw()
pl <- pl + theme(legend.position = "none", axis.title = element_blank(), axis.text.y = element_blank(),
                 axis.ticks = element_blank(),
                 panel.grid = element_blank())
pl <- pl + labs(title = "Relationship between Region, Income group, and CO2 consumption", Caption = "Plot using ggsanky", fill = 'Nodes')

pl

It has to be said that GGSankey is not the most suitable graph for our purposes. This type of graph is better used when we want to visualise the flow of movements, such as immigration, sales or cooperation, from one set of nodes (countries, companies, regions, etc.) to another set of nodes.