Summary

This short analysis is done for a self-training purpose. The dataset was found on Gapminder. The data origin is from the UN Population Division.

data source: https://www.gapminder.org/data/
data file: https://docs.google.com/spreadsheets/d/1aYvBICRlEsG0VGjJ8SPD2Ot5O2gBPdfeL0hJYQ1Na90/pub?gid=0#

The indicator represents the median age per country per year.

Dataset first overview

After loading the file and having a quick view at the data, four small issues occure:

each column name (except the first one) start by “X”
last column named “X” is empty
last row is empty
columns are not variables

I met similar “issues” during the previous analysis of the air accident data. Therefore, I will reuse (almost) the same code for cleaning.

library(ggplot2)
library(dplyr)
library(tidyr)
library(countrycode)

data.source <- read.csv("C:/Users/marc/Desktop/Data/160828_age median/indicator_median age.csv", sep = ";", header= TRUE, stringsAsFactors = FALSE)
data.source <- tbl_df(data.source)
head(data.source[,1:5], 3) #first rows and columns, we can see years go five by five.

## # A tibble: 3 x 5
##    Median.age  X1950  X1955  X1960  X1965
##         <chr>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Afghanistan 18.597 18.462 18.311 18.193
## 2      AFRICA 19.173 18.862 18.434 18.020
## 3     Albania 20.640 20.202 19.677 18.955

#All location are not countries, for example we have continents, etc...

Data cleaning

#Removing last empty row and column
data.cleaned <- data.source[-(230:231),-23]

#Moving columns into rows
data.cleaned <- gather(data.cleaned, year, median.age, -Median.age)
head(data.cleaned,3)

## # A tibble: 3 x 3
##    Median.age  year median.age
##         <chr> <chr>      <dbl>
## 1 Afghanistan X1950     18.597
## 2      AFRICA X1950     19.173
## 3     Albania X1950     20.640

colnames(data.cleaned) <- c("region", "year", "median.age")

#remove the "X" in front of each year
data.cleaned$year <- gsub("X", "", data.cleaned$year)
data.cleaned$year <- as.numeric(data.cleaned$year)

Data exploration

We will start by checking what is the World trend and prediction.

data.world <- filter(data.cleaned, region == "WORLD")

ggplot(data.world, aes(x=year, y=median.age, group=1))+
      geom_point(size=4, color= "#28ABE3")+
      geom_line(size= 1.2, color= "#28ABE3")+
      theme_bw()+
      ggtitle("World Median age evolution")+
      scale_x_continuous(breaks= seq(1950,2050,10))+
      scale_y_continuous(limits=c(10, 40), breaks= seq(0,40,5))

No surprise there, prediction is showing it’s going up! Let’s check continents now.

data.continent <- filter(data.cleaned, region %in% c("AFRICA", "ASIA","EUROPE","LATIN AMERICA AND THE CARIBBEAN","NORTHERN AMERICA","OCEANIA"))

#Palette I used below is a mix of colour I found on this website:       
#http://www.color-hex.com/color-palette/700

my.palette <- c("#b94665", "#a7a6a6", "#2ecee5", "#f7b758", "#009E73", "#f37735")

ggplot(data.continent, aes(x=year, y=median.age, group=region, colour=region))+
      geom_point(size=4)+
      geom_line(size= 1.2)+
      theme_bw()+
      ggtitle("Median age evolution by continent")+
      scale_x_continuous(breaks= seq(1950,2050,10))+
      scale_y_continuous(limits=c(10, 50), breaks= seq(0,60,10))+
      scale_colour_manual(values = my.palette)+
      theme(legend.position="bottom")

Africa is definitely the youngest continent while Europe has the palm of the oldest continent. Interesting to see, prediction indicate that the median age in Europe will decrease starting from 2040. There is also a robust increase starting from 1970 for Latin America and the Caribbean.

A quick look at Europe’s median age tendency

To go further, let’s have a look at European countries. Problem is that the information is mixed, rows from the first column of the table are sometimes countries, regions or continents.

We can use the package countrycode to map continent and region information for each country.

country <- unique(data.cleaned[[1]])
continent <- countrycode(country, origin = "country.name", destination = "continent")
region <- countrycode(country, origin = "country.name", destination = "region")

geo.table <- data.frame(country = country, continent = continent, region= region)
country.na <- is.na(geo.table$continent)

As we saw above, the dataset also includes continents (as Asia) or regions (as Central America). As we map each row, NA appears for each element of the list which is not a country / or which is not recognized as a country. None of them represent a country so we can remove them from the list.

geo.table <- na.omit(geo.table)

Moreover, there are still few mistakes that we can correct manually. For example, there is a row for “Less developed regions, excluding China” which the countrycode function interpreted as “China”.

geo.table <- geo.table[-c(9,100),]
colnames(data.cleaned)[1] <- "country"
data.cleaned <- merge(data.cleaned, geo.table, by= "country", all.x=TRUE)

Europe <- filter(data.cleaned, continent == "Europe")

ggplot(Europe, aes(x=year, y=median.age, group=country, colour=region, alpha=0.5))+
      geom_line(size= 1.0)+
      theme_bw()+
      geom_smooth(method=lm, aes(group=region), se=FALSE, colour="black", size =2)+
      scale_x_continuous(breaks= seq(1950,2050,20))+
      scale_y_continuous(limits=c(10, 60), breaks= seq(0,60,10))+
      scale_colour_manual(values = my.palette)+
      theme(legend.position="none")+
      facet_wrap( ~ region)+
      ggtitle("Median age evolution by country in Europe with regression line by region")

Southern Europe shows a really strong increase. Let’s check which countries have the biggest increase.

Europe.south <- filter(Europe, year == 1950 | year == 2050, 
      region == "Southern Europe") %>%
            spread(year, median.age, -c(1,4,5)) %>%
                  mutate(diff.1950.2050 = `2050`-`1950`) %>%
                        arrange(desc(diff.1950.2050))

## Warning in if (!is.na(fill)) {: la condition a une longueur > 1 et seul le
## premier élément est utilisé

colnames(Europe.south) <- c("country", "continent", "region", "1950", "2050", "difference btw 1950 & 2050")

Age median for Southern European countries for 1950 and 2050

knitr::kable(Europe.south[,-c(2,3)])

country	1950	2050	difference btw 1950 & 2050
Bosnia and Herzegovina	20.039	52.196	32.157
Malta	23.654	50.523	26.869
Macedonia, FYR	22.295	47.636	25.341
Portugal	26.176	50.404	24.228
Albania	20.640	44.268	23.628
Greece	25.957	49.507	23.550
Montenegro	21.343	43.858	22.515
Italy	28.620	50.539	21.919
Slovenia	27.703	48.744	21.041
Spain	27.684	48.245	20.561
Croatia	27.898	48.171	20.273
Serbia	25.832	44.731	18.899

It seems that Bosnia and Herzegovina has a really strong median age increase. This dataset is not enough to explain why; however, you can read more on this on the blog brookings.edu.

Median age tendency per country

Ndee

2016-09-04

Summary

Dataset first overview

Data cleaning

Data exploration

A quick look at Europe’s median age tendency