Summary
This short analysis is done for a self-training purpose. The dataset was found on Gapminder. The data origin is from the UN Population Division.
data source: https://www.gapminder.org/data/
data file: https://docs.google.com/spreadsheets/d/1aYvBICRlEsG0VGjJ8SPD2Ot5O2gBPdfeL0hJYQ1Na90/pub?gid=0#
The indicator represents the median age per country per year.
Dataset first overview
After loading the file and having a quick view at the data, four small issues occure:
each column name (except the first one) start by “X”
last column named “X” is empty
last row is empty
columns are not variables
I met similar “issues” during the previous analysis of the air accident data. Therefore, I will reuse (almost) the same code for cleaning.
library(ggplot2)
library(dplyr)
library(tidyr)
library(countrycode)
data.source <- read.csv("C:/Users/marc/Desktop/Data/160828_age median/indicator_median age.csv", sep = ";", header= TRUE, stringsAsFactors = FALSE)
data.source <- tbl_df(data.source)
head(data.source[,1:5], 3) #first rows and columns, we can see years go five by five.
## # A tibble: 3 x 5
## Median.age X1950 X1955 X1960 X1965
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 18.597 18.462 18.311 18.193
## 2 AFRICA 19.173 18.862 18.434 18.020
## 3 Albania 20.640 20.202 19.677 18.955
#All location are not countries, for example we have continents, etc...
Data cleaning
#Removing last empty row and column
data.cleaned <- data.source[-(230:231),-23]
#Moving columns into rows
data.cleaned <- gather(data.cleaned, year, median.age, -Median.age)
head(data.cleaned,3)
## # A tibble: 3 x 3
## Median.age year median.age
## <chr> <chr> <dbl>
## 1 Afghanistan X1950 18.597
## 2 AFRICA X1950 19.173
## 3 Albania X1950 20.640
colnames(data.cleaned) <- c("region", "year", "median.age")
#remove the "X" in front of each year
data.cleaned$year <- gsub("X", "", data.cleaned$year)
data.cleaned$year <- as.numeric(data.cleaned$year)
Data exploration
We will start by checking what is the World trend and prediction.
data.world <- filter(data.cleaned, region == "WORLD")
ggplot(data.world, aes(x=year, y=median.age, group=1))+
geom_point(size=4, color= "#28ABE3")+
geom_line(size= 1.2, color= "#28ABE3")+
theme_bw()+
ggtitle("World Median age evolution")+
scale_x_continuous(breaks= seq(1950,2050,10))+
scale_y_continuous(limits=c(10, 40), breaks= seq(0,40,5))
No surprise there, prediction is showing it’s going up! Let’s check continents now.
data.continent <- filter(data.cleaned, region %in% c("AFRICA", "ASIA","EUROPE","LATIN AMERICA AND THE CARIBBEAN","NORTHERN AMERICA","OCEANIA"))
#Palette I used below is a mix of colour I found on this website:
#http://www.color-hex.com/color-palette/700
my.palette <- c("#b94665", "#a7a6a6", "#2ecee5", "#f7b758", "#009E73", "#f37735")
ggplot(data.continent, aes(x=year, y=median.age, group=region, colour=region))+
geom_point(size=4)+
geom_line(size= 1.2)+
theme_bw()+
ggtitle("Median age evolution by continent")+
scale_x_continuous(breaks= seq(1950,2050,10))+
scale_y_continuous(limits=c(10, 50), breaks= seq(0,60,10))+
scale_colour_manual(values = my.palette)+
theme(legend.position="bottom")
Africa is definitely the youngest continent while Europe has the palm of the oldest continent. Interesting to see, prediction indicate that the median age in Europe will decrease starting from 2040. There is also a robust increase starting from 1970 for Latin America and the Caribbean.
A quick look at Europe’s median age tendency
To go further, let’s have a look at European countries. Problem is that the information is mixed, rows from the first column of the table are sometimes countries, regions or continents.
We can use the package countrycode to map continent and region information for each country.
country <- unique(data.cleaned[[1]])
continent <- countrycode(country, origin = "country.name", destination = "continent")
region <- countrycode(country, origin = "country.name", destination = "region")
geo.table <- data.frame(country = country, continent = continent, region= region)
country.na <- is.na(geo.table$continent)
As we saw above, the dataset also includes continents (as Asia) or regions (as Central America). As we map each row, NA appears for each element of the list which is not a country / or which is not recognized as a country. None of them represent a country so we can remove them from the list.
geo.table <- na.omit(geo.table)
Moreover, there are still few mistakes that we can correct manually. For example, there is a row for “Less developed regions, excluding China” which the countrycode function interpreted as “China”.
geo.table <- geo.table[-c(9,100),]
colnames(data.cleaned)[1] <- "country"
data.cleaned <- merge(data.cleaned, geo.table, by= "country", all.x=TRUE)
Europe <- filter(data.cleaned, continent == "Europe")
ggplot(Europe, aes(x=year, y=median.age, group=country, colour=region, alpha=0.5))+
geom_line(size= 1.0)+
theme_bw()+
geom_smooth(method=lm, aes(group=region), se=FALSE, colour="black", size =2)+
scale_x_continuous(breaks= seq(1950,2050,20))+
scale_y_continuous(limits=c(10, 60), breaks= seq(0,60,10))+
scale_colour_manual(values = my.palette)+
theme(legend.position="none")+
facet_wrap( ~ region)+
ggtitle("Median age evolution by country in Europe with regression line by region")
Southern Europe shows a really strong increase. Let’s check which countries have the biggest increase.
Europe.south <- filter(Europe, year == 1950 | year == 2050,
region == "Southern Europe") %>%
spread(year, median.age, -c(1,4,5)) %>%
mutate(diff.1950.2050 = `2050`-`1950`) %>%
arrange(desc(diff.1950.2050))
## Warning in if (!is.na(fill)) {: la condition a une longueur > 1 et seul le
## premier élément est utilisé
colnames(Europe.south) <- c("country", "continent", "region", "1950", "2050", "difference btw 1950 & 2050")
Age median for Southern European countries for 1950 and 2050
knitr::kable(Europe.south[,-c(2,3)])
country | 1950 | 2050 | difference btw 1950 & 2050 |
---|---|---|---|
Bosnia and Herzegovina | 20.039 | 52.196 | 32.157 |
Malta | 23.654 | 50.523 | 26.869 |
Macedonia, FYR | 22.295 | 47.636 | 25.341 |
Portugal | 26.176 | 50.404 | 24.228 |
Albania | 20.640 | 44.268 | 23.628 |
Greece | 25.957 | 49.507 | 23.550 |
Montenegro | 21.343 | 43.858 | 22.515 |
Italy | 28.620 | 50.539 | 21.919 |
Slovenia | 27.703 | 48.744 | 21.041 |
Spain | 27.684 | 48.245 | 20.561 |
Croatia | 27.898 | 48.171 | 20.273 |
Serbia | 25.832 | 44.731 | 18.899 |
It seems that Bosnia and Herzegovina has a really strong median age increase. This dataset is not enough to explain why; however, you can read more on this on the blog brookings.edu.