Project 2C - Migration Data

We are looking at data collected by the United Nations found here.

The document consists of multiple tabs of data. We will focus on Table 1 which contains International migrant stock at mid-year by sex and by major area, region, country or area, 1990-2019. We will look at the non-aggregated data for each country for both males and females to see what trends we can derive from this dataset.

Our task is to select the information of interest from the document and transform it in order to facilitate analysis.

Libraries

Load required libraries

library(tidyverse) #readr, dplyr, tidyr, stringr, tibble, ggplot2
library(knitr)
library(scales)
library(kableExtra)
library(readxl)

A helper function for displaying tables

showtable <- function(data, title) {
  kable(data, caption = title) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
}

Data Import

We start by reading in the .xlsx file (manually downloaded) from the github link above and placing it into the working directory.

untidy <- read_xlsx("UN_MigrantStockTotal_2019.xlsx", sheet = "Table 1")

Data Evaluation

Here is a quick look at the data. There are rows for every country as well as for aggregations by region. We can only find data by year and gender for the range 1990-2019.

Raw Data

One thing to note is that the excel version of the file uses merged columns has headers. When read into R, these merged headers are split into a single cell so we will have to locate that cell those cells contain header of importance.

Data Transformation

We start by selecting the rows and columns of interest. We will ignore the aggregated data and select the country rows and the and the migrant stock data for the individual sexes

clean <- untidy %>% select(2,`United Nations`,13:26) %>% slice(33:294)
clean <- clean %>% filter(!is.na(`United Nations`)) %>% select(-`United Nations`)
years <- untidy %>% select(2, 13:26) %>% slice(11)

Rename columns.

names(clean) <- as.character(years)
names(clean)[1] <- "Country"

Split the table into two to deal with the repeated column names, gather the years into the rows and create a Sex column. After that, we bind by rows the split tables back together and drop NA values.

cleaner_m <- clean[1:8]
cleaner_f <- clean[c(1,9:15)]
tidy_m <- cleaner_m %>% gather("1990":"2019", key = Year, value = MigrantCount) %>%  mutate(Sex = "Male")
tidy_f <- cleaner_f %>% gather("1990":"2019", key = Year, value = MigrantCount) %>%  mutate(Sex = "Female")
tidy <- rbind(tidy_m, tidy_f)
tidy["MigrantCount"] <- apply(tidy["MigrantCount"], 2, function(x) as.numeric(x))
tidy <- drop_na(tidy)

Let’s take a look at 15 random oberservations from this tidy data table.

showtable(sample_n(tidy, 15),"Long Format")

Long Format
Country	Year	MigrantCount	Sex
Anguilla	1995	1733	Female
Bahrain	1990	122730	Male
Philippines	2015	109783	Male
Tajikistan	2019	118207	Male
Kiribati	2005	1175	Female
United States Virgin Islands	2005	29921	Female
Samoa	2005	2940	Male
Cambodia	2019	36270	Female
Ireland	1995	118808	Female
Bosnia and Herzegovina	1990	29462	Female
Mayotte	2015	34235	Male
China, Hong Kong SAR	2005	1185121	Male
Puerto Rico	1995	177328	Female
Burkina Faso	2010	320939	Male
United Kingdom	2010	3665208	Female

Data Analysis & Visualization

We calculate the total number of migrants per year to use in our analysis. We also divide the total column by 1 million so that our data looks more clear.

data <- tidy %>% 
  group_by(Year, Sex)  %>% 
  summarize(Total=sum(MigrantCount)) %>% 
  mutate("Total_m" = round(Total/1000000,1))

We then proceed to plot our data.

ggplot(data, aes(x=Year, y=Total_m, fill=Sex)) +
  geom_bar(stat="identity", position=position_dodge()) +
  geom_text(aes(label=Total_m), vjust=1.6, color="white", position = position_dodge(0.9), size=2) +
  scale_fill_brewer(palette="Paired") +
  theme_minimal() +
  labs(title="Migrant Population by Gender", x = "Year", y = "Total (in millions)")

The plot below shows the same information as above but presented differently. What we see looking at this plot is that there seems to be an increase in the gap between male and female migrants.

ggplot(data, aes(x=Year, y=Total_m, group=Sex)) +
  geom_line(linetype="dashed", aes(color=Sex)) +
  geom_point(aes(color=Sex)) +
  geom_text(aes(label=Total_m), vjust=-1, size=2) +
  theme(legend.position="top") +
  ggtitle("Migrant Proportions by Year and Sex") +
  ylab("Total Migrants (in millions)")

To look into the this gap in more details, we study the proportions by gender.

proportions_by_year <- data %>% 
  select(-Total_m) %>%
  spread(Sex, Total) %>% 
  mutate(Total = Female + Male, Female = Female/Total, Male = Male/Total)  %>%
  select(Year, Female, Male)

proportions_by_year<- proportions_by_year %>% 
  gather("Female":"Male", key = Sex, value = Proportion)

ggplot(proportions_by_year, aes(x=Year, y=Proportion, group=Sex)) +
  geom_line(linetype="dashed", aes(color=Sex)) +
  geom_point(aes(color=Sex)) +
  theme(legend.position="top") +
  scale_y_continuous(labels = percent) +
  ggtitle("Migrant Proportions by Year and Sex")

The plot above clearly shows what was outlined above. We see that starting in 1995, the proportion of migrant has become increasingly male although we remain near parity in 2019 with 48% of women migrants and 52% male migrants.

Conclusion

While working with this dataset, we encourated a few forms of “un-tidyness”:
- A dataset containing rows that are not useful for analysis
- Tables with year values in column headers
- Merged column headings on top of other headings
- Aggregated data in the middle of the dataset

The key to working with this dataset was to identify the data of interested and in particular the rows that contained aggregated data such as regions. Thankfully, we could use other columns in the dataset to our advantage to filter the appropriate rows. We use other tools from the tidyverse to reformat the dataset into long format and tidy data to help clraify our analysis.

To take this study further, we could look at categorizing the migration movements by region and doing our own aggregations. We could perhaps see if some particular years had more pronounced migrations due to some consequential events in the particular countries. This information could serve as a guide to spotting the macro trends behind population movements.

data607-illien-project2c

Mael Illien

9/30/2019