Three years ago, i read a book by Hans Rosling that really changed my perspective about relying less on feelings (and more on data). The title of the book is “factfulness” (not going into details about the book but the major theme is the world and every single country is getting better and it’s not even up for debate). You can check out this short video by Hans Rosling https://www.youtube.com/watch?v=jbkSRLYSojo
Then i picked up a book last year by Steven Pinker Better Angels of our Nature (Bill Gate had recommended it and even said “.. one of the most important books I’ve read - not just this year but ever”). It’s the same theme as factfulness but it’s over 800 pages and Pinker went as back as 2000 years ago to track every area of violence (war, rape, genocide, animal rights, human rights) and for each of this, there has been a steady decline and the world is better and less violent today almost in every area you can think of.
Yet people don’t feel that same way. Using Nigeria as an example, If you do a poll today and ask Nigerians about the country Now and say 20-30 years ago, a lot and i mean like 90% will tell you that Nigeria was better 20-30 years ago than now (outright bullshit by the way) - ask those same people - would you rather be born 20-30 years ago than be born today ? - you’ll get a lot of “No” (they would rather be born today) than the opposite … You would think if the country was better 20-30 years ago, then it would make sense to be born 20-30 years ago than now.
It’s not hard to understand why people think/feel they have it worse than the previous generation. (i mean if you spend a lot of your time watching CNN or reading negative stories on social media - you most likely think the world is not getting any better. Unlike News companies which i think are in it for a long haul(and waning influence unlike the old days), i do believe that social media companies like Facebook, Twitter, Instagram have the power and influence to ensure that positive contents are amplified on their platform and also encourage their followers to share positive stories or events in their lives. Afterall social media is where we spend most of our (free) time these days.
First time i read about the gapminder data I didn’t pay much attention to it. Then while reading R for Data Science i came across it again and this time i decided to leverage my R skills and do some basic analysis (nothing out of the ordinary). Gapminder is a data compiled by Hans Rosling (remember him from Act I ?).
The gapminder data compiles statistical data like population | life expectancy and | gdpPercap for 142 countries from 1952 to 2007 (5 years interval for every country). We can get this data on R by loading the gapminder library package.
A snapshot of the gapminder data can be seen below:
gapminder %>%
DT::datatable()
Let’s start with a deep-dive on the data. A better way to understand the data is by plotting it. Rather than plotting by country (which won’t make much sense because of the country sample size) we can plot it by continent. This can be done using facet wrap on R, a ggplot feature.
gapminder %>%
ggplot(aes(year, lifeExp, group = country)) + geom_line() +
facet_wrap(~continent) +
labs(
x = "Year",
y = "Life Expectancy",
title = "Life Expectancy on a positive trend across the globe",
subtitle = "Each line on the plot represents a country. We are better off now than the previous generation, and they are better off \n than the generation before them. We also have some obvious outliers in Africa and Asia (lines with the V shape). \n For Africa, i'm guessing that's Rwanda [genocide effect] - and for Asia, guessing it's Vietnam [vietnam war].",
caption = "gapminder data"
) + theme(plot.title = element_text(hjust = .5,lineheight = .20),
plot.subtitle = element_text(hjust = .5),
plot.caption = element_text(size = 15))
What is obvious from our plot is the linear relationship between lifeExpectancy and year. We do have some outliers in Africa and Asia - we can see a sharp drop in life Expectancy to almost 20 years for a particular country in Africa and another one in Asia with a drop to almost 30+ years. Countries in Europe, Oceania and America don’t have an obvious outlier.
We can start making some guesses on why LifeExpectancy in Oceania, Europe and America(to some degree) seems to follow a perfect linear trend whereas Africa and Asia have a wiggly trend line. One area to always start is the leadership or type of government prevalent in that region or the type of political system (Democratic or Monarchies or Authoritarian or totalitarian). In the later half of the century, authocratic governement and totalitarian governments were more prevalent in Africa and Asia (our assumption is starting to make sense now). Americas(down South) also had a thing for Autocratic government in the later half of the century which kind of explains why their linear trend is not as perfect as Europe or Oceania but compared to Asia and Africa, their Autocratism don’t stand a chance (africa or Asia’s autocratism should be in caps like AUTOCRATISM).
Autocratic or totalitarian governement are more likely to go to war than democratic government (not sure about the numbers but i think they are 3/4 times more likely than Democratic government) - and war is the worst enemy of life Expectancy (you can start to understand where i’m driving at). If you plot wars and life Expectancy on a graph, you will get a negative correlated linear trend (a perfect one to be precise) - the trend line will be like this “\”.
Even when it’s not war, when it’s say a pandemic or an epidemic, you can’t trust an Autocratic or totalitarian government to get it right(during a pandemic or epidemic - you can expect an autocratic government in africa to prioritize their tribesmen over other tribes - so that’s another dent on the opposition tribes’ life Expectancy).
Again, If you divide countries into the type of government in each country, you can expect to see a perfect linear trend in Democratic government - and for Autocratic and totalitarian countries, you can expect a wiggly linear trend. You should have an idea of what will happen if you divide countries across Capitalist vs. Communist countries and check life Expectancy trend. (If you have no idea, just read East and Western Germans stories. Here’s one i found interesting https://www.pewresearch.org/fact-tank/2019/11/06/east-germany-has-narrowed-economic-gap-with-west-germany-since-fall-of-communism-but-still-lags/)
Another thing you might notice is that, No country in Africa has a life Expectancy of 80 years or more. Asia started worse than Africa in the 60’s but by 2007 they have some countries within the 80+ years lifeExpectancy segment. To be fair, we can counter this by saying most countries in the 80+ years bracket had a lifeExpectancy of 55-60 years in 60’s and none of these countries were from africa (to confirm this we might go ahead and check if there are countries that had the same lifeExpectancy similar to african countries in the 60’s and are now in the 80 years segment but we won’t be doing this for now - just wanted to mention a way to go about it).
Next is, we might want to see the distribution of our data to answer questions like “what’s the average life expectancy in africa ?” or “how does africa compare with other countries ?” . These are legit questions and we can answer these questions by making use of a box plot.
gapminder %>%
filter(year == 2007) %>%
ggplot(
aes(reorder(continent,lifeExp, FUN = median), lifeExp)) + geom_boxplot() +
theme(axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
plot.title = element_text(hjust = .5,lineheight = .20, size = 15),
plot.subtitle = element_text(hjust = .5,lineheight = .20, size = 13),
axis.line = element_line(colour = "black")) +
labs(title = "Only Africa have an average life expectancy below 70 years",
subtitle = "You can tell how how big a continent is by the height of it's box (More countries in Africa and Asia)", x = NULL, y = "Life Expectancy (years)")
In a box plot, you have a box with a two vertical lines (called whiskers) attached to the top and bottom of the box. The horizontal line inside the box represents where 50% of our data lies, the top of the box where the line is attached represents where 75% of our data lies, the bottom of our box where the other line is attached represents where 25% of our data lies. The whiskers represent the farthest non-outlier points. The dots represents outliers.
Using Africa as an illustration:
We can see the life expectancy distribution across all the continents with a boxplot. The “dot” you see on the plot are outliers. For example, in asia there is a country with a life expectancy below 45 years - that country is Afghanistan (which is not surprising because of the Afghan war going on in that country with Taliban and the US/Nato led forces). In Americas, the outlier is Haiti (it’s hard to pin-point the exact reason but a quick google search might help - but again, you will see a causation - maybe political instability, natural disaster or famine).
Overall, africa have an average life expectancy below 60 years, the country with the highest ‘life expectancy’ is in Asia although Asia’s average is the second lowest (The straight line inside the box is the average line). Oceania have the best average, which is not surprising (we have less than 5 countries there).
To understand how box plot works or refresh your memory, here is a short video that i found really good on box plot. https://www.youtube.com/watch?v=fJZv9YeQ-qQ
To dive deeper on life-expectancy, we might want to have a view of the countries with the higest life expectancy. Below is a plot of top 30 countries based on life expectancy in 2007
gapminder %>%
filter(year == 2007) %>%
top_n(30, lifeExp) %>%
ggplot(
aes(lifeExp,reorder(country,lifeExp))) +
geom_point() +
theme(
panel.grid.major = element_line(linetype = "dotted",color = "black"),
panel.grid.minor = element_line(linetype = "dotted",color = "black"),
axis.text.y = element_text(size = 13),
axis.text.x = element_text(size = 13),
plot.title = element_text(hjust = .5,lineheight = .20, size = 15),
plot.subtitle = element_text(hjust = .5,lineheight = .20, size = 13)) +
labs(y = NULL, x = "Life Expectancy (years)",
title = "Japan have the highest life expectancy in the world (increased by almost 20 years in 55 years)",
subtitle = "Top 30 countries based on life expectancy in 2007. ")
A single generation in japan (people born before the 50’s) have seen their country make significant progress in life expectancy. If they had a child in the 60’s, the child life expectancy would have been 63 years, and their grandchild (if born in 2007) would have a life expectancy of 82 years (19 years more than their parents). I wonder if Japan’s diet is another reason for the rapid increase in life expectancy (Also Hong. Kong).
Infact, considering that Eastern Asia countries have to recover from the war in the late 40’s (in a way that Europe didn’t have to), their growth in all aspects have been much more impressive and astronomical compared to other regions (by Eastern Asia i mean China, Taiwan, Hong Kong, Japan, Korea). Two of these countries are the top 2 in life expectancy and 4 of them are in top 30.
Also look out for Meditarranean countries (their diet might also be a plus).
Population might also be a factor in the sense that the bigger the population, the more spread out the life expectancy the harder it is to have a high life expectancy or the harder it is to compete with small countries. We saw this with Oceania countries in the plot before this. Because they have less than five countries you can understand why they’d have the highest average across all continents.
Now this is a good start.
Next, let’s plot gdpPercap and LifeExpectancy. What’s the relationship ?
#find outliers
Outliers <- gapminder %>%
filter(gdpPercap >= 60000) %>%
mutate(concat = str_c(country, year, sep = "_"))
#Kuwait in 2007 (where was kuwait in 2007)
Kuwait <- gapminder %>%
filter(country == "Kuwait" & year == 2007) %>%
mutate(concat = str_c(country, year, sep = "_"))
gapminder %>%
ggplot(aes(gdpPercap, lifeExp)) +
geom_point(aes(color = "firebrick"), size = 1) +
theme(legend.position = "none") +
scale_x_continuous(labels = scales::dollar_format()) +
labs(
title = "Correlation between Life Exp & gdp is not obvious except for countries with more than 60 years LifeExp.",
subtitle = "kuwait backed by huge oil reserves enjoyed enormous wealth and prosperity between 50's and 70's even till today.",
y = "LifeExp (years)") +
geom_point(size = 2, color="red", data = Outliers) +
geom_label_repel(aes(label = concat), data = Outliers) +
theme(plot.title = element_text(hjust = .5,lineheight = .20),
plot.subtitle = element_text(hjust = .5,lineheight = .20),
panel.grid.major = element_line(linetype = "dotted",color = "black"),
panel.grid.minor = element_line(linetype = "dotted",color = "black"),
axis.line = element_line(colour = "black"),
axis.text.y = element_text(size = 13),
axis.text.x = element_text(size = 13)) +
geom_label_repel(aes(label = concat), data = Kuwait) +
geom_point(size = 2, color="red", data = Kuwait)
Looking at this plot, it’s not very easy to explain what’s really happening. But you can see the positive correlation (starting from 60+ years LifeExpectancy and also some outliers on the far right of the graph). For countries below 60 years lifeExp that break out from the pile - those are most likely Arab countries (Oil prosperity impact). By now you should know that we have multiple representations for each country on our plot - it’s not 1 point per country, it’s 12 points for each country (from 1952 - 2007, 5 years interval).
A better way to understand the relationship between lifeExpectancy and gdpPercap would be to log-transform gdpPercap.
# get the average gdpPercap for all countries and then for each continent
best_per_conti <- gapminder %>%
mutate(concat = str_c(country,year,sep="_")) %>%
filter(
country %in% c("Nigeria", "China", "United States","Japan","France","Taiwan","Brazil","Equatorial Guinea")) %>%
filter(year %in% c(1952, 2007))
gapminder %>%
ggplot(aes(gdpPercap, lifeExp)) +
annotate("rect", xmin = 32000, xmax = 120050, ymin = 55, ymax = 70, color = "blue", alpha = 0.1) +
geom_point(aes(color = "firebrick"), alpha = 0.3) +
scale_x_log10(labels = NULL) +
theme(legend.position = "none") +
geom_point(size = 2, color = "red", data = best_per_conti) +
geom_label_repel(aes(label = concat), data = best_per_conti) +
theme(axis.line = element_line(colour = "black"),
axis.ticks = element_blank(),
panel.grid.major = element_line(linetype = "dotted",color = "black"),
panel.grid.minor = element_line(linetype = "dotted","black"),
plot.title = element_text(hjust = .5,lineheight = .20),
plot.subtitle = element_text(hjust = .5,lineheight = .20),
axis.text.x = element_text(size = 13),
axis.text.y = element_text(size = 13)) +
ylab("LifeExp (years)") +
labs(
title = "Lower left quadrant occupied by African countries. Upper right occuppied by Europeans, USA. Asia is catching up",
subtitle = "Asian countries took their growth personal. Arabian countries made huge gains from their oil reserves.") + annotate("segment",x = 44000, xend = 44000, y = 51, yend = 54.5,color = "black", size = 1.5, arrow = arrow()) +
annotate("text", x = 55000, y = 49, label = "Oil rich countries (Kuwait and \n most likely Saudi Arabia)", color = "black", size = 4) +
annotate("text", x = 500, y = 68, label = "Median",size = 5) +
annotate("text", x = 16000, y = 22, label = "Median",size = 5) +
geom_hline(yintercept = 67.00742, linetype = "dashed", size = 1) +
geom_vline(xintercept = 11680.07, linetype = "dashed", size = 1)
Now the exponential relationship is obvious. We can see china’s exponential growth between 1952 and 2007. No continent took their growth more personal than the Asian continent (see Japan, China and Taiwan - East Asia again). Asian countries economic reforms in the 60’s and 70’s is paying off. Tremendous growth by Equatorial guinea(their gdpPercap increased from $2814 in 19997 to 12,154 dollars in just 10 years). Although i should also mention the curse of this growth, Equatorial Guinea have the highest Gini Index in the world (Gini index is used to measure wealth distribution or inequality).
Countries that have median lifeExpectancy and big gdpPercap are oil rich countries (we saw this with Kuwait, Saudi Arabia should be very close). Main take-away from this plot is - Rich countries are healthier and poor countries have low expectancy rate.
Let’s not forget that this data is from 2007. If we plot the latest data now, China would be in the top right quadrant of our graph.
We have too many points on our plot that it wouldn’t make sense to highlight every country on our graph. What we might want to do instead is group countries into continents and see how they compare to each other.
There are some basic facts that we know before now - I don’t have to convince you that africa is not as prosperous as Europe, you obviously knew this 5, 10 years ago - it’s a basic fact. However, what you might not know is “How rich european countries are” or “how poor are the countries in some part of Asia or africa or America” or “how far have Asian countries grown from 1952 to 2007” . Knowing any of these about continents, can give you a hint of growth trajectory for most countries (countries are in continent, if you know where a continent is heading, you know where the countries in it are going).
We can start by looking at each continent and how far they have grown from 1952 to 2007 in terms of lifeExpectancy and also gdpPercap.
gapminder %>%
ggplot(aes(gdpPercap, lifeExp, color = continent)) +
geom_point() +
scale_x_log10(labels = NULL) +
facet_wrap(~year, nrow =2) +
theme(axis.ticks = element_blank(),
axis.text.y = element_text(size = 12))
The time period is split between 2 rows - which kind of makes it a little hard to know what’s going on. But you we can still notice the disparity between European countries and african countries.
Let’s make the plot more intuitive by having all the plots in a single row.
gapminder %>%
ggplot(aes(gdpPercap,lifeExp, color = continent)) + geom_point() + scale_x_log10(labels = NULL) + facet_wrap(~year, nrow = 1) +
theme(axis.ticks = element_blank())
Now, we can see the trend and growth in LifeExpectancy and it’s a linear growth (lifeExpectancy increase every 5 years or year on year inshort). In 1952, the poorest countries occupied the 30-40 years quadrant of lifeExpectancy - that number is up to 40-50 (10 years increase).
10 years increase in lifeExpectancy in 55 years is huge especially for african countries that occupied that quadrant (you have to consider the number of famine, epidemic, mass killings, genocides, political killings and wars fought in africa between that period to understand why this is huge. To still come out on the other side with 10+ years increase is enormous progress - which makes you wonder “What if”).
If Africa’s growth in LifeExpectancy was in kilometers, Asia’s would be in miles. In 1952, Asia and africa were at the lower quadrant,55 years later, Asian countries are almost converging at the top quadrant. In 1952, you can group countries into Africa, Asia and Others - by 2007 that grouping has changed to “Africa and Others”. Which says a lot about how much catching up we have to do on this side (especially in Sub-Sahara africa).
With Europe and USA, it’s the same scenario - they always lead the pack. In the 50’s most european countries occupied the upper quadrant (60-70 years) that number is up to 75 to 85 years.
I will digress a bit here [but Still On Europe and USA], one of the arguments David Servan-Schreiber made in his book Anti-Cancer is how the western diet is having enormous negative impact on our well-being and health (sugar effect). He also mentioned how Mediterranean and Asian diet are way better (especially Japan) and as a result the people in that region live longer. In 1952, Japan was far behind the USA in life Expectancy - 55 years later, not only have they overtaken USA and most of western Europe, they are into the 80+ years quadrant (US has been stuck between 77 and 79 years for more than 10 years). This makes me wonder if the stagnant growth(or very slow growth) in life expectancy in Europe or USA can be blamed on their diet (and even Africa, western diet is a big part of our diet now) and i have no doubt that is the case. The argument is harder to win because unlike war, where the effect is exponential and in double-time, diet has a slow and gradual effect that is not easy to pin-point. The case on diet will be for another day.
We can flip our axis by putting gdpPercap on the y-axis to see the changes in gdpPercap for the continents.
gapminder %>%
ggplot(aes(lifeExp,gdpPercap, color = continent)) +
geom_point() + scale_y_log10(label = NULL) +
facet_wrap(~year, nrow = 1) + theme(axis.ticks = element_blank())
Progress in gdpPercap is not as pronounced as lifeExpectancy but we can still see the progress. Because i have log-transformed the gdpPercap, we can’t put up a specific figure but we can still make some sense of what the trend is. We can see the usual trend we mentioned earlier of Asian countries growing exponentially to the extent that they are in the middle quadrant now and Africa are as usual (trailing the park).
As for Europeans and USA, the gradual growth is obvious. What is not so obvious is the movement of countries in America. Countries in America are more spread out in 2007 than say before the 80’s (meaning a gap is opening up between the economies in that continent). Brazil is making big gains in gdpPercap (as we have seen in the previous plot before this).
Before i end this section, let’s see how far countries have grown in terms of gdpPercap since 1952. Which countries have gained ground the most since 1952 ?
step_1 <- gapminder %>% filter(year == 1952 | year == 1997 | year == 2002 | year == 2007)
step_1lf <- step_1 %>% select(country,continent, year, lifeExp)
step_1p <- step_1 %>% select(country,continent, year, pop)
step_1g <- step_1 %>% select(country,continent, year, gdpPercap)
step_p <- step_1g %>% pivot_wider(names_from = year, values_from = gdpPercap)
step_1 <- step_p %>% mutate(Prop = (step_p$'2007' - step_p$'1952')/(step_p$'1952')) %>% top_n(30,Prop)
ggplot(step_1,aes(reorder(country,Prop),y = Prop, fill = continent)) +
geom_bar(stat = "identity") + coord_flip() +
theme(panel.grid.major.x = element_line(linetype = "dotted",color = "black"),
panel.grid.minor = element_line(linetype = "dotted","black"),
panel.grid.major.y = element_blank(),legend.position = c(1,0),
legend.justification = c(3.3,-4),
legend.background = element_rect(fill = "grey92",color = "grey90"),
legend.text = element_text(size = 12), legend.title = element_text(size = 12),
plot.title = element_text(hjust = .5,lineheight = .20, size = 15),plot.subtitle = element_text(hjust = .5,lineheight = .20, size = 12), axis.text.y = element_text(size = 13),axis.text.x = element_text(size = 13)) +
ylab("percent increase \n gdpPercap growth,1952 to 2007") +
xlab(NULL) + labs(title = "Equitorial Guinea have grown exponentially since 1952. They also have the highest gini index in the world.",
subtitle = "Asian countries make up the top pile. Few countries in Europe also have a thing with growth") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 33),breaks = c(10,20,30), labels = scales::percent_format())
scale_x_continuous(breaks = c(1950, 1970, 1990))
## <ScaleContinuousPosition>
## Range:
## Limits: 0 -- 1
step_2 <- gapminder %>% filter(year == 1987 | year == 1992 | year == 1997 | year == 2002 | year == 2007) %>%
select(country,continent, year, gdpPercap)
Equatorial guinea’s gdpPercap has grown by more than 3,000%. Taiwan in second place with more 2,000% gdpPercap growth. Remember that thing East Asia had with growth in the life expectancy section ? The same pattern here in the plot above (we’ll come back to this).
I don’t expect to the see the super powers here (USA, UK, France or Allied Forces from WWII) but i was on the look out for countries that lost out in World War II (because their gdpPercap would be very low and their economy was more likely to be on a recovery path) which is why i’m not surprised to see Italy, Japan, China (routed by Japan in WWII) in there and even Spain(Civil war).
Most likely what you will have in cases like this is countries with low gdpPercap in the 60’s. I mean it’s easier to have a 500% or 1000% growth rate if your gdpPercap is low (say less than a thousand in the 60’s) compared to if your gdpPercap is high (say 10,000 gdpPercap in the 60’s).
If you do the same plot 50 years from now (say from year 2000 to 2050), you are more likely to see countries like Nigeria, Chad, Sudan top this kind of list.
So i don’t want you to see this and think these countries performed better than the big countries between the 60’s and now.
One last thing i was interested in finding out was the growth trend of these countries between 1957 and now. I was on the look-out for when their growth rate went off the chart - a regression towards the mean - kind of situation. A situation where their growth rate was extreme (very high) - because you know at some point their growth rate will be closer to the mean (back to normal). Let’s do this for the top 5 countries.
#spool out Top countries and also render East Asian countries and check growth rate overtime
#use lead and lag to get growth rate.
Top_3 <- gapminder %>%
filter(country %in% c("Korea, Rep.", "Taiwan","Equatorial Guinea","Singapore","Botswana")) %>%
select(country, year, gdpPercap) %>%
group_by(country) %>%
mutate(
diff_year = year - lag(year),
diff_growth = gdpPercap - lag(gdpPercap),
gdp_percentage_increase = (diff_growth/lag(gdpPercap))*100
) %>%
filter(
year %in% c(1957, 1967, 1977, 1987,1997,2007))
ggplot(Top_3, aes(
year, gdp_percentage_increase,group=country)) +
geom_line(aes(color = country, alpha = 1),size = 1) +
geom_point(aes(color = country, alpha = 1),size = 3) +
scale_x_continuous(position = "top",
breaks = c(1957, 1967, 1977, 1987, 1997, 2007),
limits = (c(1957,2015))) +
scale_y_continuous(labels = function(x) paste0(x, "%")) +
theme(
legend.position = "none",
panel.border = element_blank(),
axis.line.y = element_line("black"), axis.title = element_blank(),
axis.title.y = element_text("gdpPercap percentage increase"),
axis.text.y = element_text(size = 12)) +
geom_text(data = Top_3 %>%
filter(year == 2007),
aes(label = country),
hjust = -.1, fontface = "bold", size = 4)
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
Equtorial guniea gdpPercap grew by more 120% between 1987 and 1997 (a perfect case of regression to the mean) and followed that with another 50% the following decade. Guinea’s growth will make you think the growth in the other four countries are normal - but they are not. These are very impressive numbers. Taiwan was growing 25% on average every decade until 1987-1997. Korea grew by more than 50% between 1967 and 1997 and maintained that rate into the next decade.
If you look at our growth plot again, you should notice how our top 10 countries (gdpPercap % increase) is dominated by Asian countries. Your next question might be “what do these countries have in common - the Asian countries” . Oman is certainly in that group because of their Oil reserves, and they are also in the middle east which makes them an outlier in that group. Thailand and Singapore are both in South-East asia. What i’m driving at is that you would have noticed “Japan, China, Taiwan, Korea Rep, Hong Kong China” in that top 10 cluster. These countries share borders and are close to eachother.
For this reason, we might be interested in countries in east asia and check their growth trend from 1957 to 2007. Below is a map and also a facet plot of East asia countries and their gdpPercap growth trend.
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
East_Asia <- gapminder %>%
filter(country %in% c("China", "Japan", "Taiwan","Korea, Dem. Rep.","Korea, Rep.","Mongolia","Hong Kong, China")) %>%
select(country, year, gdpPercap) %>%
group_by(country) %>%
mutate(
diff_year = year - lag(year),
diff_growth = gdpPercap - lag(gdpPercap),
rate_percent = (diff_growth/lag(gdpPercap))*100)
ggplot(East_Asia, aes(year, rate_percent, group = country)) +
geom_line() +
geom_point(colour = "red", size = 2) +
facet_wrap(~ country)
Can you guess which country is North korea and which is south korea from the plots ?
China’s numbers are consistent and impressive, Hong Kong China and Taiwan,are actually the most consistent in terms of growth (never below 0% growth). Now Japan’s numbers might come across as decent but it’s actually a good return.
Korea, Rep is South Korea and Korea, Dem. Rep is North Korea. North Korea numbers was pretty impressive in the 70’s and 80’s. But due to economic santions and isolation, their economies suffered in the 90’s. South Korea on the other hand are reaping the benefits of trade and non-isolation.
Mongolia numbers are not overtly bad but not in the same trend as the other non-isolated asian countries. However, numbers are on a positive trend.
The thing is you might read this differently depending on your domain knowledge.
Again, i don’t want you to make a mistake of thinking a drop in growth is the same as a drop in gdpPercap. No, it isn’t. A drop in growth simply means what it is. If i had 10 cars today and added another 10 cars that’s 100% growth rate(or whatever we might want to call it), now that i have 20 cars if i added another 10 cars - the addition rate is not the same even though i added the same number of items (this is what makes East asia impressive, the fact that they are growing at almost the same proportion not year on year but decades on decades) Countries struggle to achieve a growth rate of 5%, politicians do boost of attaining or growing the economy for as small as 3/4% growth.
So far, we have been talking in the context of continents, how Europe is leading the pack, Asia’s significant growth and how African countries are chasing. We rarely dive into the country territory or discussion. We do that now.
First thing we are going to do is to divide all countries into two batches, the reason for this is we have too many countries to deal with and it is easier to put things in perspective or render intuitive visual (instead of having a jam-packed visuals). Even with this division, some of our visuals still suffer from too many data points but we are still better off.
The 2 batches are going to be small and big countries.
To understand the relationship between small countries, we are going to start with the PCA (principal component analysis) plot. PCA allows us to compute principal components and we then use this components to visualize and understand our data.
Remember when we wanted to find the relationship between lifeExpectancy and year, we plotted a scatterplot. Then to find the relationship between lifeExpectancy and gdpPercap we plotted another scatterplot. That’s two scatterplot. Imagine we have 10 variables and we wanted to find the relationship, you can imagine a situation where we have multiple scatter plots.
To visualize 10 variables, you have to plot 45 scatterplots to find the relationship between those variables [p(p-1)/2] where p = 1 variable. That’s simply cumbersome and nobody wants to do that. This is where PCA steps in.
PCA allows us to find the relationship between variables with just a single plot. It’s a powerful tool and an unsupervised approach in machine learning. I won’t go into the tiny details of PCA, however i will explain it on a basic level.
We start by doing a deep-dive into small countries with PCA. You should have noticed by now that our Variables(LifeExpectancy, population and gdpPercap) are in different proportion, PCA automatically centers these variables to have a mean of zero and then we scale the variables to have a standard deviation of one.
This is what i’m driving at (check below). If we don’t scale our variables, our data will be driven more by population since it has a bigger variance and mean.
for_scaling <- gapminder %>% filter(year == 2007)
apply(for_scaling[,-(1:2)], 2, mean) #checking means of our variables
## year lifeExp pop gdpPercap
## 2.007000e+03 6.700742e+01 4.402122e+07 1.168007e+04
apply(for_scaling[,-(1:2)], 2, var) #checking the variance of our variables
## year lifeExp pop gdpPercap
## 0.000000e+00 1.457578e+02 2.179208e+16 1.653780e+08
#break countries into 2 batches (small [population<10M] and big countries[population >= 10M]);
#For small countries
#step 1: for small fit the PCA and clustering algorithm;
#step 2: Plot
# filter for countries with a population below 10M (small countries)
gp_small_countries <- gapminder %>% filter(year == 2007 & pop < 10000000) %>%
select(c(country, lifeExp, pop, gdpPercap)) %>%
as.data.frame() %>% set_rownames(.[,1]) %>%
.[,-1]
#perform pca on small countries
gp_s_pca <- prcomp(gp_small_countries, scale = T)
#rotate pca direction
gp_s_pca$rotation = - gp_s_pca$rotation
gp_s_pca$x = -gp_s_pca$x
#plot pca
biplot(gp_s_pca,scale = 0, arrow.len = 0.1, expand = 1, cex = c(1,1.4))
How to read data ?
Above is a PCA plot of all small countries. From the plot we can see that LifeExp and gdpPercap are close to each other (we call both the first loading vector) and Pop is far from both(second loading vector). This means that LifeExp and and gdpPercap are correlated with each other and both are less correlated with Pop (since they are far apart). Let’s confirm this:
cor(gp_small_countries)
## lifeExp pop gdpPercap
## lifeExp 1.00000000 0.0713115297 0.6604376893
## pop 0.07131153 1.0000000000 0.0007362266
## gdpPercap 0.66043769 0.0007362266 1.0000000000
Confirmed . In PCA, the farther apart our variables are, the less correlated they are to eachother. Inshort, countries with high gdpPercap will have high lifeExpectancy (and vice-versa) and population have less say or impact on a country’s gdpPercap or lifeExpectancy. Hence, countries (like Isreal, Denmark, Kuwait, Sweden, Ireland) on the right hand side of our plot have high gdpPercap and lifeExpectancy - and countries (like Liberia, Lesotho, Congo,Rep, Mongolia, Burundu) on the left hand side have low gdpPercap and lifeExpectancy. Also countries like Iceland, Bahrain, Slovenia, Gabon, Comoros (on the lower part of the plot) have very low population whereas countries like Hungary, Benin, Somalia, Bolivia on the upper part of our plot have bigger population. Countries in the lower half have low population, countries in the upper half have large population.
How can we know the countries with the highest gdp from our plot ? the farther a country is to the east or hug the borders to the right, the higher the gdpPercap of that country (In our plot, that’s Singapore and Norway). You can use that same logic for the poorest countries (You should know what’s happening in Sierra Leone or Liberia). Apply this same logic to population (but make it North and South).
Overall, as expected the right side is dominated by European countries and the left side is dominated by Africa.
Notice Libya in the middle of our plot (this was 2007) - if we do an updated PCA for 2020, Libya would be far away on the left (due to Intrastate war from Arab spring).
Let’s check out the Big guns.
# Repeat the same process for Big countries
gp_big_countries <- gapminder %>% filter(year == 2007 & pop >= 10000000) %>%
select(c(country, lifeExp, pop, gdpPercap)) %>%
as.data.frame() %>% set_rownames(.[,1]) %>%
.[,-1]
gp_b_pca <- prcomp(gp_big_countries, scale = T)
biplot(gp_b_pca,scale = 0, arrow.len = 0.1, expand = 0.9, cex = c(1,1.4))
For the big guns, it’s harder to read. The big guns have so much in common than small countries. Despite this, we can see a trend and see some countries pulling out. As always, it’s not hard to tell that China and India are the most populated countries in the pack. United states, Japan, Germany, Canada, Netherlands have high gdpPercap and lifeExpectancy. Nigeria, Ethiopia, Pakistan have a lot of catching up to do. Brazil, Mexico have average levels of lifeExpectancy and gdpPercap (above average to be precise).
To compensate for the not so intuitive PCA plot for the big countries, i will supplement it with a corrplot of the big countries.
corrplot(cor(t(scale(gp_big_countries))),order = "hclust",tl.cex = 0.9)
Word of caution on corrplot. I’m using a special type of corrplot that is not as accurate as PCA or Hierarchical clustering but it’s still accurate nonethless. Or Maybe i have not figured out how to make it as accuate as PCA (Most likely this).
This should not be too hard to figure out but becareful when reading this. Don’t read this and say United states and France or Germany are dissimilar. Countries that are paired together are more correlated to eachother.
The first column of the grid shows the correlation between brazil and all other countries, the last grid, is a correlation of all countries with venezuela. Countries that are correlated should have a blue grid (the darker the better) - if red, they are negatively correlated (the darker the farther apart they are).
. I will say there are five groups:
Group 1 starts with Brazil and ends with Pakistan. These countries are correlated and have much more in common that other countries or groups. You can say “Brazil and Yemen” should have nothing in common and that might be true but not exactly the whole truth. Remeber this is a 2007 data, Yemen was Yemen then (improving year on year and way better than what it is today).
Group 2 starts with Angola and ends with Mali. This group is dominated by African countries. You do a plot of the current data, and china or India wouldn’t be in this group. And you have the feeling that what these countries have so much in common is their huge population size;
Group 3 are the pace-setters. Starting with USA and ends with Portugal; This group is driven more by gdpPercap. This is the only group with countries that i think haven’t been to war in the last 70 years (counting from 2021) or had wars within their borders. Although we have USA, who is always a major player in most international conflicts but these conflicts are usually far away from their borders(Internationalized intrastate war and interstate war). They are the major reason why the second half of the century is called the Long Peace. You don’t want any of these countries going to war, it would be way more lethal than any other group. You also have to consider that China are in this group now. Maybe you haven’t noticed yet, but we don’t have Russia in this dataset but i also expect them to be in this group.
Group 4, you have Phillipines and ends with Serbia;
Group 5 starts with Poland and ends with Venezuela.
You don’t want a war - either interstate (between countries) or intrastate (civil war) between the Group 3 countries. It will be lethal and most likely a nuclear war. The last war between these countries was World War II (USA detonated two nuclear bombs in 3 days in Japan). It took Japan more than 40 years to recover from that. A nuclear bomb today will kill twice as much people because population density has increased significantly to what it was 80 years ago.
Instead of PCA, another unsupervised approach we can use to understand our data is Hierarchical clustering . With hierarchical clustering you get a tree-representation called dendogram. For Hierarchical clustering we use a dissimilarity measure (how) between each pair of observation. I will be using the Euclidean distance dissimilarity.
Hierarchical clustering is a bottom up approach meaning each observation is treated as 1 cluster, so you can imagine a situation where you have 100 observations, then you start with 100 clusters. What happens next is as you move from from cluster to another, the clusters that are similar are fused together so that we now have (100-1 clusters).
The typical scenario is like this: Japan is in cluster 1, Sweden is in cluster 2 - Ah they are similar, fuse them together so we have just 1 cluster (Japan and Sweden). Check cluster 3, is it similar to cluster 1 ?, if yes - fuse it together. So we have 1 cluster (cluster 2 and 3 don’t exist again - you should get the drill by now). That’s how Hierarchical on a basic level. As always we scale our data so that our clustering is not driven more by a particular feature.
We start with the small countries.
#perform clustering on small countries - Tip: scale data before plotting
gp_s_cluster <- hclust(dist(scale(gp_small_countries)),method = "complete")
plot(gp_s_cluster, cex = 1, col = "red", main = "European countries and few countries from Asia occupy the prosperous spot (first branch by the left). \n Jamaica and Lebanon share a lot in common, same with Panama and Uruguay.")
No need to dive into the details, by now you should have an idea how this plays out.
Then we check out the big guys.
# Repeat the same process for Big countries
gp_big_countries <- gapminder %>% filter(year == 2007 & pop >= 10000000) %>%
select(c(country, lifeExp, pop, gdpPercap)) %>%
as.data.frame() %>% set_rownames(.[,1]) %>%
.[,-1]
gp_b_cluster <- hclust(dist(scale(gp_big_countries)))
plot(gp_b_cluster, cex = 1, col = "red", main = "Disparity exists between countries in Africa (See Tunisia, Egypt, Algeria). \n African Countries not in Sub-sahara africa have more in common with developing nations than countries in Sub-sahara.")
Note the dissimilarity between Brazil and Yemen in our dendogram (whereas our corrplot says they are similar - Brazil and Yemen relationship in the corrplot is driven more by population and lifeExpectancy unlike in Hierarchical clustering). If you have to pick based on accuracy, go with Hierarchical clustering.
Third branch of the tree occupied by developing countries (Brazil, Indonesia and the rest).
Non sub-saharan african countries like Morocco, Egypt, Tunisia are more developed than Sub-saharan countries. This shouldn’t come as a surprise for several reasons. One is, most of these countries share a border (or just a boat or ship away) with European countries and as we have seen progress filters down from Europe and the closer you are to prosperity the easier it is to get prosperous. Another thing is these countries are more homogenous than most countries in Africa, take a country like Nigeria where we have 371 tribes, the easiest way to gain power in a group is to amplify dissimilarity between your tribe and others and stoke hatred. Overtime you have an in-effective leader who can always go back to his tactics to use as a smoke screen for his ineffectiveness (we’ve seen this happen again and again). Inshort, leaders are held more accountable in homogenous countries.
I hope my points about the disparity between Sub-Saharan and non sub-saharan african countries doesn’t come as an excuse, but i do believe all these things add up over a long period of time. Now, i have mentioned homogenuity as a better driver of progress, i should also mention the disadvantage of it - I believe(this is my belief) Atrocites like genocide are more likely to happen in an homogenous society than heterogenous society. It is easy to convince a particular sect to kill the other sect (in an homogenous society) if you know you are just one sect away from dominating the country - it’s a case of “you or me” (we saw this happen in Rwanda, Turkey, Nazi Germany). It is way harder for that to be happen in an heterogenous society (where you have say 10, 20 tribes) because here it’s case of “you or me or them”. You can kill one, then you have 19 coming for you, that’s enough detterant. We do have cases of genocides in heterogenous society but they are way less lethal, the Civil war in Nigeria would have been worse if the country was divided along two ethnic lines and we’d have seen a genocide of mass proportion happen.
One thing i learnt while reading Better Angels of our Nature and Guns of August is Progress in countries (and progress in general) are not that simplistic to explain. But we live in a world where we want quick answers so we look for short-cut answer. My point about the disparity between Sub-saharan and non sub-saharan might be a short-cut answer but it’s a good place to start from and the more you read or deep-dive on them the more you supplement your opinion.
To end this section, i will do a cluster analysis of all countries together. Before now, we have been using a segment of big countries and small countries. What will be nice is to have a cluster analysis for all countries together. We do that with k-means clustering or Hierarchical clustering.
gp_s_cluster <- hclust(dist(scale(gp_small_countries)),method = "complete")
gp_all_countries <- gapminder %>% filter(year == 2007) %>%
select(c(country, lifeExp, pop, gdpPercap)) %>%
as.data.frame() %>% set_rownames(.[,1]) %>%
.[,-1]
gp_all_countries_cluster <- hclust(dist(scale(gp_all_countries)))
cluster <- cutree(gp_all_countries_cluster,4)
#sort cluster
sort(cluster)
## Afghanistan Angola Benin
## 1 1 1
## Botswana Burkina Faso Burundi
## 1 1 1
## Cameroon Central African Republic Chad
## 1 1 1
## Congo, Dem. Rep. Congo, Rep. Cote d'Ivoire
## 1 1 1
## Djibouti Equatorial Guinea Ethiopia
## 1 1 1
## Gabon Guinea Guinea-Bissau
## 1 1 1
## Kenya Lesotho Liberia
## 1 1 1
## Malawi Mali Mozambique
## 1 1 1
## Namibia Niger Nigeria
## 1 1 1
## Rwanda Sierra Leone Somalia
## 1 1 1
## South Africa Swaziland Tanzania
## 1 1 1
## Uganda Zambia Zimbabwe
## 1 1 1
## Albania Algeria Argentina
## 2 2 2
## Bahrain Bangladesh Bolivia
## 2 2 2
## Bosnia and Herzegovina Brazil Bulgaria
## 2 2 2
## Cambodia Chile Colombia
## 2 2 2
## Comoros Costa Rica Croatia
## 2 2 2
## Cuba Czech Republic Dominican Republic
## 2 2 2
## Ecuador Egypt El Salvador
## 2 2 2
## Eritrea Gambia Ghana
## 2 2 2
## Greece Guatemala Haiti
## 2 2 2
## Honduras Hungary Indonesia
## 2 2 2
## Iran Iraq Israel
## 2 2 2
## Jamaica Jordan Korea, Dem. Rep.
## 2 2 2
## Korea, Rep. Lebanon Libya
## 2 2 2
## Madagascar Malaysia Mauritania
## 2 2 2
## Mauritius Mexico Mongolia
## 2 2 2
## Montenegro Morocco Myanmar
## 2 2 2
## Nepal New Zealand Nicaragua
## 2 2 2
## Oman Pakistan Panama
## 2 2 2
## Paraguay Peru Philippines
## 2 2 2
## Poland Portugal Puerto Rico
## 2 2 2
## Reunion Romania Sao Tome and Principe
## 2 2 2
## Saudi Arabia Senegal Serbia
## 2 2 2
## Slovak Republic Slovenia Sri Lanka
## 2 2 2
## Sudan Syria Taiwan
## 2 2 2
## Thailand Togo Trinidad and Tobago
## 2 2 2
## Tunisia Turkey Uruguay
## 2 2 2
## Venezuela Vietnam West Bank and Gaza
## 2 2 2
## Yemen, Rep. Australia Austria
## 2 3 3
## Belgium Canada Denmark
## 3 3 3
## Finland France Germany
## 3 3 3
## Hong Kong, China Iceland Ireland
## 3 3 3
## Italy Japan Kuwait
## 3 3 3
## Netherlands Norway Singapore
## 3 3 3
## Spain Sweden Switzerland
## 3 3 3
## United Kingdom United States China
## 3 3 4
## India
## 4
I was 80% done with part I and 95% done with part II when i came across another gapminder data set in r. It’s not the latest data (up to 2016 data) but compared to ours it’s more recent. There was no need to start all over again because it’s the same trend and for this new gapminder data we have a lot of missing data (mostly gdp between 2013-2016 was missing). We do have Russia in the data set and the continents are further split into regions and we have some new features like infant mortality and fertility .
I won’t go over the details but i will add a plot i found interesting (but nothing unexpected).
gap_latest <- dslabs::gapminder
gap_latest %>% ggplot(aes(year, fertility,color = "red")) +
geom_point(alpha = 0.7, shape = 20, size = 1.7) +
facet_wrap(~region) + theme(legend.position = "none")
## Warning: Removed 187 rows containing missing values (geom_point).
Birth-rate is dropping year on year all over the world. As always, drop started earlier in Europe (Western Europe to be specific).
Part I ends here, will post Part II in the next 10 hours. In Part II, we go over modeling and predicting future data with our current data-set.