Libraries
Dataset
Life expectancy & Socio-Economic (world bank). The researchers who published the dataset in Kaggle, originally used it to analyze sub-Saharan countries through the socio-economics and environmental factors which might affect their life expectancy. Using the dataset they attempted to answer these four following research questions:
- What’s the Impact of Expenditure on Health and Education (% of GDP) on Life Expectancy?
- How does the prevalence of undernourishment and communicable disease Affect Life Expectancy?
- Do factors like corruption and unemployment rate impact life expectancy? If yes, quantify
- Increase in CO2 emissions decrease life expectancy? Is it significant?
More about the dataset can be found here.
file <- read.csv("/Users/PaulMUTAMBA/Documents/Old Pc/HSE Social Info/4th Year/Advance Data Analysis/Data/life expectancy.csv")DataFrame Creation
df <- file %>% select(Country.Name, Region, IncomeGroup, Year, Life_expectancy = Life.Expectancy.World.Bank, Unemployment, CO2) %>% filter(Year == 2009 | Year == 2018)
df_le_region <- df %>% filter (Year == 2018) %>% group_by(Region) %>% summarise(Mean_Life_Expectancy = median(Life_expectancy, na.rm = T))Average Life Expectancy per Region in 2018
ggplot(df_le_region, aes(x = Region, y = Mean_Life_Expectancy, fill = Region)) +
geom_bar(stat = "identity") + theme(text = element_text(size=8), axis.text.x = element_text(angle=90, hjust=1)) +
labs(fill = "Region", x = "Region", y = "Averages Life Expectancy", caption = "Data source: Life expectancy & Socio-Economic (world bank)") + coord_flip() + theme(plot.caption = element_text(hjust = 0.5)) + scale_fill_brewer(palette="Dark2")Unployement’s Distribution rate in 2018 in Sub-Saharian Africa
ggplot(data = df %>% filter(Year == 2018, Region == "Sub-Saharan Africa" ), aes(x = Unemployment)) + geom_histogram(binwidth = 3, color="black", fill= "#FFB273") + geom_vline(aes(xintercept= mean(Unemployment, na.rm = T), color = "mean"), linetype="dashed", size=1) + labs(x = "Unemployment", caption = "Unemployment refers to the % share of the labor force that is without work \n but available for and seeking employment", fills = " ") + theme(plot.title = element_text(hjust = 0.5), plot.caption = element_text(hjust = 0.25, size = 0.25)) + theme_bw() + scale_color_manual(name = "Measurement", values = c(mean = "#824acd")) Life Expectancy vs CO2 Consumption of Sub-Saharian Africa and Middle East & North Africa
ggplot(df %>% filter(Year == 2018, Region %in% c("Sub-Saharan Africa", "Middle East & North Africa") ), aes ( x = log10(CO2), y = Life_expectancy)) + geom_point() + stat_smooth(method = "lm", col = "#C42126", se = FALSE, size = 1.5) + labs(x = "CO2 emissions - log 10", y = "Life Expectancy", caption = "Data source: Life expectancy & Socio-Economic (world bank)") + theme_bw()CO2 Emissions per Region
ggplot(data = df %>% filter(Region %in% c("Sub-Saharan Africa", "Middle East & North Africa", "South Asia")), aes(x= Region, y= log10(CO2), fill= as.factor(Year))) +
scale_fill_brewer(palette="Dark2") +
geom_boxplot(alpha=0.5, notch = TRUE) +
xlab("Region") +
ylab("CO2 Emission - Log10") +
labs(caption = "Data source: Life expectancy & Socio-Economic (world bank)", fill = "Year") +
theme(plot.caption = element_text(hjust = 0.5)) Average CO2 Consumption of different Income Group per Region in 2018
sliced_df <- df %>% filter(Year == 2018) %>% group_by(Region, IncomeGroup, Year) %>% summarise(avg_CO2_Consumption = mean(CO2, na.rm = T))
kable(sliced_df) %>%
kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)| Region | IncomeGroup | Year | avg_CO2_Consumption |
|---|---|---|---|
| East Asia & Pacific | High income | 2018 | 316608.007 |
| East Asia & Pacific | Lower middle income | 2018 | 98067.273 |
| East Asia & Pacific | Upper middle income | 2018 | 1375827.460 |
| Europe & Central Asia | High income | 2018 | 111568.573 |
| Europe & Central Asia | Lower middle income | 2018 | 102256.666 |
| Europe & Central Asia | Upper middle income | 2018 | 40745.386 |
| Latin America & Caribbean | High income | 2018 | 20456.667 |
| Latin America & Caribbean | Lower middle income | 2018 | 9698.000 |
| Latin America & Caribbean | Upper middle income | 2018 | 78734.705 |
| Middle East & North Africa | High income | 2018 | 130653.750 |
| Middle East & North Africa | Lower middle income | 2018 | 57715.999 |
| Middle East & North Africa | Upper middle income | 2018 | 81516.665 |
| North America | High income | 2018 | 2777700.043 |
| South Asia | Low income | 2018 | 6070.000 |
| South Asia | Lower middle income | 2018 | 460103.321 |
| South Asia | Upper middle income | 2018 | 2100.000 |
| Sub-Saharan Africa | High income | 2018 | 580.000 |
| Sub-Saharan Africa | Low income | 2018 | 4149.545 |
| Sub-Saharan Africa | Lower middle income | 2018 | 15836.667 |
| Sub-Saharan Africa | Upper middle income | 2018 | 76928.334 |
GGSankey Plot
I found GGSankey on twitter using the hashtag #tidytuesday. A few years ago I used to follow #tidytuesday on twitter (now “X”) for ploting inspiration. So for the assignment, I looked at recent tweets that used the hashtag to see if there were any new techniques I could, and would, apply. One of the tweets used GGSankey - a completely new package and technique for me - to analyse the flow of immigration from Ukraine, Afghanistan and Syria to Germany, Iran and Turkey. As this package was new to me, I decided to apply it to current homework. Although GGSankey is based on the GGplot2 package, it uses a very different technique to initialise the aesthetics. GGSankey requires three ‘special’ arguments in the aesthetic - next_x, node, next_node. More information on using GGSankey can be found here.
I applied the package to my case to visualise the relationship between regions, income group and CO2 consumption (which I turned into a character, as GGSankey only works with character or factor variables).
#summary(sliced_df$avg_CO2_Consumption)
sliced_df$CO2_Consumption <- case_when(
sliced_df$avg_CO2_Consumption < 14303 ~ "Very Low Consumption",
sliced_df$avg_CO2_Consumption < 77833 ~ "Low Consumption",
sliced_df$avg_CO2_Consumption < 288366 ~ "Average Consumption",
sliced_df$avg_CO2_Consumption <= 2777700 ~ "High Consumption")
sliced_df2 <- sliced_df %>% ungroup()
sliced_df2 <- na.omit(sliced_df2)Region- Income_Group - CO2 Consumption
sliced_df2$CO2_Consumption1 <- factor(sliced_df2$CO2_Consumption,
levels = c("Very Low Consumption", "Low Consumption", "Average Consumption", "High Consumption"))
#levels(sliced_df2$CO2_Consumption)
sliced_df2long <- sliced_df2 %>% make_long(Region, CO2_Consumption1, IncomeGroup)
dagg <- sliced_df2long %>% dplyr::group_by(node) %>% tally()
df2 <- merge(sliced_df2long, dagg, by.x = 'node', by.y = 'node', all.x = TRUE)
pl <- ggplot(df2, aes(x = x, next_x = next_x, node = node, next_node = next_node, fill = factor(node), label = paste0(node, " n=", n)))
pl <- pl + geom_sankey(flow.alpha = 0.5,
node.color = "black",
show.legend = TRUE)
pl <- pl + geom_sankey_label(size = 3, color = "black", fill = "white")
pl <- pl + theme_bw()
pl <- pl + theme(legend.position = "none", axis.title = element_blank(), axis.text.y = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank())
pl <- pl + labs(title = "Relationship between Region, Income group, and CO2 consumption", Caption = "Plot using ggsanky", fill = 'Nodes')
plIt has to be said that GGSankey is not the most suitable graph for our purposes. This type of graph is better used when we want to visualise the flow of movements, such as immigration, sales or cooperation, from one set of nodes (countries, companies, regions, etc.) to another set of nodes.