We are looking at data collected by the United Nations found here.
The document consists of multiple tabs of data. We will focus on Table 1 which contains International migrant stock at mid-year by sex and by major area, region, country or area, 1990-2019. We will look at the non-aggregated data for each country for both males and females to see what trends we can derive from this dataset.
Our task is to select the information of interest from the document and transform it in order to facilitate analysis.
Load required libraries
library(tidyverse) #readr, dplyr, tidyr, stringr, tibble, ggplot2
library(knitr)
library(scales)
library(kableExtra)
library(readxl)A helper function for displaying tables
We start by reading in the .xlsx file (manually downloaded) from the github link above and placing it into the working directory.
Here is a quick look at the data. There are rows for every country as well as for aggregations by region. We can only find data by year and gender for the range 1990-2019.
Raw Data
One thing to note is that the excel version of the file uses merged columns has headers. When read into R, these merged headers are split into a single cell so we will have to locate that cell those cells contain header of importance.
We start by selecting the rows and columns of interest. We will ignore the aggregated data and select the country rows and the and the migrant stock data for the individual sexes
clean <- untidy %>% select(2,`United Nations`,13:26) %>% slice(33:294)
clean <- clean %>% filter(!is.na(`United Nations`)) %>% select(-`United Nations`)
years <- untidy %>% select(2, 13:26) %>% slice(11)Rename columns.
Split the table into two to deal with the repeated column names, gather the years into the rows and create a Sex column. After that, we bind by rows the split tables back together and drop NA values.
cleaner_m <- clean[1:8]
cleaner_f <- clean[c(1,9:15)]
tidy_m <- cleaner_m %>% gather("1990":"2019", key = Year, value = MigrantCount) %>% mutate(Sex = "Male")
tidy_f <- cleaner_f %>% gather("1990":"2019", key = Year, value = MigrantCount) %>% mutate(Sex = "Female")
tidy <- rbind(tidy_m, tidy_f)
tidy["MigrantCount"] <- apply(tidy["MigrantCount"], 2, function(x) as.numeric(x))
tidy <- drop_na(tidy)Let’s take a look at 15 random oberservations from this tidy data table.
| Country | Year | MigrantCount | Sex |
|---|---|---|---|
| Anguilla | 1995 | 1733 | Female |
| Bahrain | 1990 | 122730 | Male |
| Philippines | 2015 | 109783 | Male |
| Tajikistan | 2019 | 118207 | Male |
| Kiribati | 2005 | 1175 | Female |
| United States Virgin Islands | 2005 | 29921 | Female |
| Samoa | 2005 | 2940 | Male |
| Cambodia | 2019 | 36270 | Female |
| Ireland | 1995 | 118808 | Female |
| Bosnia and Herzegovina | 1990 | 29462 | Female |
| Mayotte | 2015 | 34235 | Male |
| China, Hong Kong SAR | 2005 | 1185121 | Male |
| Puerto Rico | 1995 | 177328 | Female |
| Burkina Faso | 2010 | 320939 | Male |
| United Kingdom | 2010 | 3665208 | Female |
We calculate the total number of migrants per year to use in our analysis. We also divide the total column by 1 million so that our data looks more clear.
data <- tidy %>%
group_by(Year, Sex) %>%
summarize(Total=sum(MigrantCount)) %>%
mutate("Total_m" = round(Total/1000000,1))We then proceed to plot our data.
ggplot(data, aes(x=Year, y=Total_m, fill=Sex)) +
geom_bar(stat="identity", position=position_dodge()) +
geom_text(aes(label=Total_m), vjust=1.6, color="white", position = position_dodge(0.9), size=2) +
scale_fill_brewer(palette="Paired") +
theme_minimal() +
labs(title="Migrant Population by Gender", x = "Year", y = "Total (in millions)")The plot below shows the same information as above but presented differently. What we see looking at this plot is that there seems to be an increase in the gap between male and female migrants.
ggplot(data, aes(x=Year, y=Total_m, group=Sex)) +
geom_line(linetype="dashed", aes(color=Sex)) +
geom_point(aes(color=Sex)) +
geom_text(aes(label=Total_m), vjust=-1, size=2) +
theme(legend.position="top") +
ggtitle("Migrant Proportions by Year and Sex") +
ylab("Total Migrants (in millions)")To look into the this gap in more details, we study the proportions by gender.
proportions_by_year <- data %>%
select(-Total_m) %>%
spread(Sex, Total) %>%
mutate(Total = Female + Male, Female = Female/Total, Male = Male/Total) %>%
select(Year, Female, Male)
proportions_by_year<- proportions_by_year %>%
gather("Female":"Male", key = Sex, value = Proportion)ggplot(proportions_by_year, aes(x=Year, y=Proportion, group=Sex)) +
geom_line(linetype="dashed", aes(color=Sex)) +
geom_point(aes(color=Sex)) +
theme(legend.position="top") +
scale_y_continuous(labels = percent) +
ggtitle("Migrant Proportions by Year and Sex")The plot above clearly shows what was outlined above. We see that starting in 1995, the proportion of migrant has become increasingly male although we remain near parity in 2019 with 48% of women migrants and 52% male migrants.
While working with this dataset, we encourated a few forms of “un-tidyness”:
- A dataset containing rows that are not useful for analysis
- Tables with year values in column headers
- Merged column headings on top of other headings
- Aggregated data in the middle of the dataset
The key to working with this dataset was to identify the data of interested and in particular the rows that contained aggregated data such as regions. Thankfully, we could use other columns in the dataset to our advantage to filter the appropriate rows. We use other tools from the tidyverse to reformat the dataset into long format and tidy data to help clraify our analysis.
To take this study further, we could look at categorizing the migration movements by region and doing our own aggregations. We could perhaps see if some particular years had more pronounced migrations due to some consequential events in the particular countries. This information could serve as a guide to spotting the macro trends behind population movements.