Summary

This short analysis is done for a self-training purpose. The dataset was found on Gapminder. The data origin is from the UN Population Division.

data source: https://www.gapminder.org/data/
data file: https://docs.google.com/spreadsheets/d/1aYvBICRlEsG0VGjJ8SPD2Ot5O2gBPdfeL0hJYQ1Na90/pub?gid=0#

The indicator represents the median age per country per year.

Dataset first overview

After loading the file and having a quick view at the data, four small issues occure:

  • each column name (except the first one) start by “X”

  • last column named “X” is empty

  • last row is empty

  • columns are not variables

I met similar “issues” during the previous analysis of the air accident data. Therefore, I will reuse (almost) the same code for cleaning.

library(ggplot2)
library(dplyr)
library(tidyr)
library(countrycode)
data.source <- read.csv("C:/Users/marc/Desktop/Data/160828_age median/indicator_median age.csv", sep = ";", header= TRUE, stringsAsFactors = FALSE)
data.source <- tbl_df(data.source)
head(data.source[,1:5], 3) #first rows and columns, we can see years go five by five. 
## # A tibble: 3 x 5
##    Median.age  X1950  X1955  X1960  X1965
##         <chr>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Afghanistan 18.597 18.462 18.311 18.193
## 2      AFRICA 19.173 18.862 18.434 18.020
## 3     Albania 20.640 20.202 19.677 18.955
#All location are not countries, for example we have continents, etc...

Data cleaning

#Removing last empty row and column
data.cleaned <- data.source[-(230:231),-23]

#Moving columns into rows
data.cleaned <- gather(data.cleaned, year, median.age, -Median.age)
head(data.cleaned,3)
## # A tibble: 3 x 3
##    Median.age  year median.age
##         <chr> <chr>      <dbl>
## 1 Afghanistan X1950     18.597
## 2      AFRICA X1950     19.173
## 3     Albania X1950     20.640
colnames(data.cleaned) <- c("region", "year", "median.age")

#remove the "X" in front of each year
data.cleaned$year <- gsub("X", "", data.cleaned$year)
data.cleaned$year <- as.numeric(data.cleaned$year)

Data exploration

We will start by checking what is the World trend and prediction.

data.world <- filter(data.cleaned, region == "WORLD")

ggplot(data.world, aes(x=year, y=median.age, group=1))+
      geom_point(size=4, color= "#28ABE3")+
      geom_line(size= 1.2, color= "#28ABE3")+
      theme_bw()+
      ggtitle("World Median age evolution")+
      scale_x_continuous(breaks= seq(1950,2050,10))+
      scale_y_continuous(limits=c(10, 40), breaks= seq(0,40,5))

No surprise there, prediction is showing it’s going up! Let’s check continents now.

data.continent <- filter(data.cleaned, region %in% c("AFRICA", "ASIA","EUROPE","LATIN AMERICA AND THE CARIBBEAN","NORTHERN AMERICA","OCEANIA"))

#Palette I used below is a mix of colour I found on this website:       
#http://www.color-hex.com/color-palette/700

my.palette <- c("#b94665", "#a7a6a6", "#2ecee5", "#f7b758", "#009E73", "#f37735")

ggplot(data.continent, aes(x=year, y=median.age, group=region, colour=region))+
      geom_point(size=4)+
      geom_line(size= 1.2)+
      theme_bw()+
      ggtitle("Median age evolution by continent")+
      scale_x_continuous(breaks= seq(1950,2050,10))+
      scale_y_continuous(limits=c(10, 50), breaks= seq(0,60,10))+
      scale_colour_manual(values = my.palette)+
      theme(legend.position="bottom")

Africa is definitely the youngest continent while Europe has the palm of the oldest continent. Interesting to see, prediction indicate that the median age in Europe will decrease starting from 2040. There is also a robust increase starting from 1970 for Latin America and the Caribbean.

A quick look at Europe’s median age tendency

To go further, let’s have a look at European countries. Problem is that the information is mixed, rows from the first column of the table are sometimes countries, regions or continents.

We can use the package countrycode to map continent and region information for each country.

country <- unique(data.cleaned[[1]])
continent <- countrycode(country, origin = "country.name", destination = "continent")
region <- countrycode(country, origin = "country.name", destination = "region")

geo.table <- data.frame(country = country, continent = continent, region= region)
country.na <- is.na(geo.table$continent)

As we saw above, the dataset also includes continents (as Asia) or regions (as Central America). As we map each row, NA appears for each element of the list which is not a country / or which is not recognized as a country. None of them represent a country so we can remove them from the list.

geo.table <- na.omit(geo.table)

Moreover, there are still few mistakes that we can correct manually. For example, there is a row for “Less developed regions, excluding China” which the countrycode function interpreted as “China”.

geo.table <- geo.table[-c(9,100),]
colnames(data.cleaned)[1] <- "country"
data.cleaned <- merge(data.cleaned, geo.table, by= "country", all.x=TRUE)
Europe <- filter(data.cleaned, continent == "Europe")

ggplot(Europe, aes(x=year, y=median.age, group=country, colour=region, alpha=0.5))+
      geom_line(size= 1.0)+
      theme_bw()+
      geom_smooth(method=lm, aes(group=region), se=FALSE, colour="black", size =2)+
      scale_x_continuous(breaks= seq(1950,2050,20))+
      scale_y_continuous(limits=c(10, 60), breaks= seq(0,60,10))+
      scale_colour_manual(values = my.palette)+
      theme(legend.position="none")+
      facet_wrap( ~ region)+
      ggtitle("Median age evolution by country in Europe with regression line by region")

Southern Europe shows a really strong increase. Let’s check which countries have the biggest increase.

Europe.south <- filter(Europe, year == 1950 | year == 2050, 
      region == "Southern Europe") %>%
            spread(year, median.age, -c(1,4,5)) %>%
                  mutate(diff.1950.2050 = `2050`-`1950`) %>%
                        arrange(desc(diff.1950.2050))
## Warning in if (!is.na(fill)) {: la condition a une longueur > 1 et seul le
## premier élément est utilisé
colnames(Europe.south) <- c("country", "continent", "region", "1950", "2050", "difference btw 1950 & 2050")

Age median for Southern European countries for 1950 and 2050

knitr::kable(Europe.south[,-c(2,3)])
country 1950 2050 difference btw 1950 & 2050
Bosnia and Herzegovina 20.039 52.196 32.157
Malta 23.654 50.523 26.869
Macedonia, FYR 22.295 47.636 25.341
Portugal 26.176 50.404 24.228
Albania 20.640 44.268 23.628
Greece 25.957 49.507 23.550
Montenegro 21.343 43.858 22.515
Italy 28.620 50.539 21.919
Slovenia 27.703 48.744 21.041
Spain 27.684 48.245 20.561
Croatia 27.898 48.171 20.273
Serbia 25.832 44.731 18.899

It seems that Bosnia and Herzegovina has a really strong median age increase. This dataset is not enough to explain why; however, you can read more on this on the blog brookings.edu.