Project 2C - Migration Data

We are looking at data collected by the United Nations found here.

The document consists of multiple tabs of data. We will focus on Table 1 which contains International migrant stock at mid-year by sex and by major area, region, country or area, 1990-2019. We will look at the non-aggregated data for each country for both males and females to see what trends we can derive from this dataset.

Our task is to select the information of interest from the document and transform it in order to facilitate analysis.

Data Import

We start by reading in the .xlsx file (manually downloaded) from the github link above and placing it into the working directory.

Data Evaluation

Here is a quick look at the data. There are rows for every country as well as for aggregations by region. We can only find data by year and gender for the range 1990-2019.

Raw Data

Raw Data

One thing to note is that the excel version of the file uses merged columns has headers. When read into R, these merged headers are split into a single cell so we will have to locate that cell those cells contain header of importance.

Data Transformation

We start by selecting the rows and columns of interest. We will ignore the aggregated data and select the country rows and the and the migrant stock data for the individual sexes

Rename columns.

Split the table into two to deal with the repeated column names, gather the years into the rows and create a Sex column. After that, we bind by rows the split tables back together and drop NA values.

Let’s take a look at 15 random oberservations from this tidy data table.

Long Format
Country Year MigrantCount Sex
Anguilla 1995 1733 Female
Bahrain 1990 122730 Male
Philippines 2015 109783 Male
Tajikistan 2019 118207 Male
Kiribati 2005 1175 Female
United States Virgin Islands 2005 29921 Female
Samoa 2005 2940 Male
Cambodia 2019 36270 Female
Ireland 1995 118808 Female
Bosnia and Herzegovina 1990 29462 Female
Mayotte 2015 34235 Male
China, Hong Kong SAR 2005 1185121 Male
Puerto Rico 1995 177328 Female
Burkina Faso 2010 320939 Male
United Kingdom 2010 3665208 Female

Data Analysis & Visualization

We calculate the total number of migrants per year to use in our analysis. We also divide the total column by 1 million so that our data looks more clear.

We then proceed to plot our data.

The plot below shows the same information as above but presented differently. What we see looking at this plot is that there seems to be an increase in the gap between male and female migrants.

To look into the this gap in more details, we study the proportions by gender.

The plot above clearly shows what was outlined above. We see that starting in 1995, the proportion of migrant has become increasingly male although we remain near parity in 2019 with 48% of women migrants and 52% male migrants.

Conclusion

While working with this dataset, we encourated a few forms of “un-tidyness”:
- A dataset containing rows that are not useful for analysis
- Tables with year values in column headers
- Merged column headings on top of other headings
- Aggregated data in the middle of the dataset

The key to working with this dataset was to identify the data of interested and in particular the rows that contained aggregated data such as regions. Thankfully, we could use other columns in the dataset to our advantage to filter the appropriate rows. We use other tools from the tidyverse to reformat the dataset into long format and tidy data to help clraify our analysis.

To take this study further, we could look at categorizing the migration movements by region and doing our own aggregations. We could perhaps see if some particular years had more pronounced migrations due to some consequential events in the particular countries. This information could serve as a guide to spotting the macro trends behind population movements.