Abstract

The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work. Your task is to:

  • Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets: +Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
  • Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
  • Perform the analysis requested in the discussion item.
  • Your code should be in an R Markdown file, posted to rpubs.com, and should include
    • narrative descriptions of your data cleanup work, analysis, and conclusions.
    • Please include in your homework submission, for each of the three chosen datasets:
  • The URL to the .Rmd file in your GitHub repository, and
  • The URL for your rpubs.com web page.

Introduction

The United Nations, Department of Economic and Social Affairs published dataset presents estimates of international migrant by age, sex and origin. Estimates are presented for 1990, 1995, 2000, 2005, 2010, 2015 and 2019 and are available for all countries and areas of the world. The estimates are based on official statistics on the foreign-born or the foreign population.

Questions

I decided to get more insight about the data provided by the United Nation for the international migrant stock on the last decade. Some of the questions I want to invistigate is:

  • Which countries are the best destination for migrants.
  • What are the top migrants country of origin?
  • What is the age categories for the migrants? adult, children, or seniors
  • How many refugees top cdestination countries can accept per time interval?

Libraries used

  • library(readr)
  • library(dplyr)
  • library(janitor)
  • library(tidyr)
  • library(ggmap)
  • library(ggplot2)
  • library(RCurl)

Dataset_1: International migrant stock By destination and origin

I will start by investigating on the total number of migrants by destination and origin country. First, we read the dataset into R.

Read the data into R

The files provided by the UN are Excel files, each file contains several tables nested as worksheets. To be able to read a specific table, I downloaded the file, then save as the file as a csv, each table into individual file (the folders are included in GihHub repo). After that I was able to read the file using readr library using read_csv(). I used skip = 16 argument inside read_csv to remove the worksheet header (United nation information, title,…etc).

We get in return 1,991 rows with 246 columns - wide table version. this dataset needs tidying, but to do so, we have to clean up first NA, unused columns, from char to dbl types,…etc.

Remove Regions and keep only Countries

First thing you would notice is that the first 8 rows are not countries, they are regions. This time we want to see how many people are migrating from countries to countries so we can remove these rows for the regions. When you look at ‘X6’ column, it looks that those ‘region’ rows don’t have any value there.

By running the command like below to keep only the rows whose X6 column have NA

These are all regions, not countries, which means that we can safely remove these rows by adding an exclamation mark ‘!’ right before ‘is.na()’ function like below.

## [1] 1624  246

Now we got a dataframe of 1,624 rows with 246 columns as return with all countries only data per year interval.

Remove unnecessary columns

When you look at the columns we would notice that there are unnecessary columns like ‘Total’, ‘Other South’, etc, because we are interested in estimates of the migrants only for countries to countries. We can remove those unnecessary columns with ‘select’ command along with other unnecessary columns like below. I’m using minus ‘-’ to delete columns and using ‘start_with’ function inside ‘select’ command to delete multiple columns whose names matche the text pattern of “Other”.

## [1] 1624  239

We now have 239 columns with 1,624 rows in total as a dataframe with only country to country data.

Rename Columns

Gather 234 columns to make it tidy

Now, it’s ready to tidy this ‘matrix’-ish data form by using ‘gather’ command from tidyr package. I will gather all origin countries in one column with a corresponding migrants number. Basically, we will gather from column number 4 to column 234 into two columns origin_country and migrants. I used colnames() to get the exact columns number

##   [1] "year"                              
##   [2] "destination_country"               
##   [3] "Afghanistan"                       
##   [4] "Albania"                           
##   [5] "Algeria"                           
##   [6] "American Samoa"                    
##   [7] "Andorra"                           
##   [8] "Angola"                            
##   [9] "Anguilla"                          
##  [10] "Antigua and Barbuda"               
##  [11] "Argentina"                         
##  [12] "Armenia"                           
##  [13] "Aruba"                             
##  [14] "Australia"                         
##  [15] "Austria"                           
##  [16] "Azerbaijan"                        
##  [17] "Bahamas"                           
##  [18] "Bahrain"                           
##  [19] "Bangladesh"                        
##  [20] "Barbados"                          
##  [21] "Belarus"                           
##  [22] "Belgium"                           
##  [23] "Belize"                            
##  [24] "Benin"                             
##  [25] "Bermuda"                           
##  [26] "Bhutan"                            
##  [27] "Bolivia (Plurinational State of)"  
##  [28] "Bonaire, Sint Eustatius and Saba"  
##  [29] "Bosnia and Herzegovina"            
##  [30] "Botswana"                          
##  [31] "Brazil"                            
##  [32] "British Virgin Islands"            
##  [33] "Brunei Darussalam"                 
##  [34] "Bulgaria"                          
##  [35] "Burkina Faso"                      
##  [36] "Burundi"                           
##  [37] "Cabo Verde"                        
##  [38] "Cambodia"                          
##  [39] "Cameroon"                          
##  [40] "Canada"                            
##  [41] "Cayman Islands"                    
##  [42] "Central African Republic"          
##  [43] "Chad"                              
##  [44] "Channel Islands"                   
##  [45] "Chile"                             
##  [46] "China"                             
##  [47] "China, Hong Kong SAR"              
##  [48] "China, Macao SAR"                  
##  [49] "Colombia"                          
##  [50] "Comoros"                           
##  [51] "Congo"                             
##  [52] "Cook Islands"                      
##  [53] "Costa Rica"                        
##  [54] "Côte d'Ivoire"                     
##  [55] "Croatia"                           
##  [56] "Cuba"                              
##  [57] "Curaçao"                           
##  [58] "Cyprus"                            
##  [59] "Czechia"                           
##  [60] "Dem. People's Republic of Korea"   
##  [61] "Democratic Republic of the Congo"  
##  [62] "Denmark"                           
##  [63] "Djibouti"                          
##  [64] "Dominica"                          
##  [65] "Dominican Republic"                
##  [66] "Ecuador"                           
##  [67] "Egypt"                             
##  [68] "El Salvador"                       
##  [69] "Equatorial Guinea"                 
##  [70] "Eritrea"                           
##  [71] "Estonia"                           
##  [72] "Eswatini"                          
##  [73] "Ethiopia"                          
##  [74] "Falkland Islands (Malvinas)"       
##  [75] "Faroe Islands"                     
##  [76] "Fiji"                              
##  [77] "Finland"                           
##  [78] "France"                            
##  [79] "French Guiana"                     
##  [80] "French Polynesia"                  
##  [81] "Gabon"                             
##  [82] "Gambia"                            
##  [83] "Georgia"                           
##  [84] "Germany"                           
##  [85] "Ghana"                             
##  [86] "Gibraltar"                         
##  [87] "Greece"                            
##  [88] "Greenland"                         
##  [89] "Grenada"                           
##  [90] "Guadeloupe"                        
##  [91] "Guam"                              
##  [92] "Guatemala"                         
##  [93] "Guinea"                            
##  [94] "Guinea-Bissau"                     
##  [95] "Guyana"                            
##  [96] "Haiti"                             
##  [97] "Holy See"                          
##  [98] "Honduras"                          
##  [99] "Hungary"                           
## [100] "Iceland"                           
## [101] "India"                             
## [102] "Indonesia"                         
## [103] "Iran (Islamic Republic of)"        
## [104] "Iraq"                              
## [105] "Ireland"                           
## [106] "Isle of Man"                       
## [107] "Israel"                            
## [108] "Italy"                             
## [109] "Jamaica"                           
## [110] "Japan"                             
## [111] "Jordan"                            
## [112] "Kazakhstan"                        
## [113] "Kenya"                             
## [114] "Kiribati"                          
## [115] "Kuwait"                            
## [116] "Kyrgyzstan"                        
## [117] "Lao People's Democratic Republic"  
## [118] "Latvia"                            
## [119] "Lebanon"                           
## [120] "Lesotho"                           
## [121] "Liberia"                           
## [122] "Libya"                             
## [123] "Liechtenstein"                     
## [124] "Lithuania"                         
## [125] "Luxembourg"                        
## [126] "Madagascar"                        
## [127] "Malawi"                            
## [128] "Malaysia"                          
## [129] "Maldives"                          
## [130] "Mali"                              
## [131] "Malta"                             
## [132] "Marshall Islands"                  
## [133] "Martinique"                        
## [134] "Mauritania"                        
## [135] "Mauritius"                         
## [136] "Mayotte"                           
## [137] "Mexico"                            
## [138] "Micronesia (Fed. States of)"       
## [139] "Monaco"                            
## [140] "Mongolia"                          
## [141] "Montenegro"                        
## [142] "Montserrat"                        
## [143] "Morocco"                           
## [144] "Mozambique"                        
## [145] "Myanmar"                           
## [146] "Namibia"                           
## [147] "Nauru"                             
## [148] "Nepal"                             
## [149] "Netherlands"                       
## [150] "New Caledonia"                     
## [151] "New Zealand"                       
## [152] "Nicaragua"                         
## [153] "Niger"                             
## [154] "Nigeria"                           
## [155] "Niue"                              
## [156] "North Macedonia"                   
## [157] "Northern Mariana Islands"          
## [158] "Norway"                            
## [159] "Oman"                              
## [160] "Pakistan"                          
## [161] "Palau"                             
## [162] "Panama"                            
## [163] "Papua New Guinea"                  
## [164] "Paraguay"                          
## [165] "Peru"                              
## [166] "Philippines"                       
## [167] "Poland"                            
## [168] "Portugal"                          
## [169] "Puerto Rico"                       
## [170] "Qatar"                             
## [171] "Republic of Korea"                 
## [172] "Republic of Moldova"               
## [173] "Réunion"                           
## [174] "Romania"                           
## [175] "Russian Federation"                
## [176] "Rwanda"                            
## [177] "Saint Helena"                      
## [178] "Saint Kitts and Nevis"             
## [179] "Saint Lucia"                       
## [180] "Saint Pierre and Miquelon"         
## [181] "Saint Vincent and the Grenadines"  
## [182] "Samoa"                             
## [183] "San Marino"                        
## [184] "Sao Tome and Principe"             
## [185] "Saudi Arabia"                      
## [186] "Senegal"                           
## [187] "Serbia"                            
## [188] "Seychelles"                        
## [189] "Sierra Leone"                      
## [190] "Singapore"                         
## [191] "Sint Maarten (Dutch part)"         
## [192] "Slovakia"                          
## [193] "Slovenia"                          
## [194] "Solomon Islands"                   
## [195] "Somalia"                           
## [196] "South Africa"                      
## [197] "South Sudan"                       
## [198] "Spain"                             
## [199] "Sri Lanka"                         
## [200] "State of Palestine"                
## [201] "Sudan"                             
## [202] "Suriname"                          
## [203] "Sweden"                            
## [204] "Switzerland"                       
## [205] "Syrian Arab Republic"              
## [206] "Tajikistan"                        
## [207] "Thailand"                          
## [208] "Timor-Leste"                       
## [209] "Togo"                              
## [210] "Tokelau"                           
## [211] "Tonga"                             
## [212] "Trinidad and Tobago"               
## [213] "Tunisia"                           
## [214] "Turkey"                            
## [215] "Turkmenistan"                      
## [216] "Turks and Caicos Islands"          
## [217] "Tuvalu"                            
## [218] "Uganda"                            
## [219] "Ukraine"                           
## [220] "United Arab Emirates"              
## [221] "United Kingdom"                    
## [222] "United Republic of Tanzania"       
## [223] "United States of America"          
## [224] "United States Virgin Islands"      
## [225] "Uruguay"                           
## [226] "Uzbekistan"                        
## [227] "Vanuatu"                           
## [228] "Venezuela (Bolivarian Republic of)"
## [229] "Viet Nam"                          
## [230] "Wallis and Futuna Islands"         
## [231] "Western Sahara"                    
## [232] "Yemen"                             
## [233] "Zambia"                            
## [234] "Zimbabwe"                          
## [235] "X242"                              
## [236] "X243"                              
## [237] "X244"                              
## [238] "X245"                              
## [239] "X246"
## [1] 78754     4

Now we condensed our data into long version - only 4 columns. However, migrants column is a character type, we need to convert it into double type. I used the parse_number() method inside the readr package to do the job for me.

Perform Statistics

Here we want to know which countries has the highest number of migrant by year. To do so, I would use group_by from tidyr then get the summation of all migrants per destination country.

## Classes 'tbl_df', 'tbl' and 'data.frame':    78754 obs. of  4 variables:
##  $ year               : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ destination_country: chr  "Namibia" "South Africa" "Egypt" "Libya" ...
##  $ origin_country     : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ migrants           : num  64 59 237 677 254 ...
##   ..- attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame':  345 obs. of  4 variables:
##   .. ..$ row     : int  496 546 597 648 1397 1419 1589 1644 1938 1956 ...
##   .. ..$ col     : int  NA NA NA NA NA NA NA NA NA NA ...
##   .. ..$ expected: chr  "a number" "a number" "a number" "a number" ...
##   .. ..$ actual  : chr  "-" "-" "-" "-" ...

As we can see here, we got more than 78,700 colums. I would prefer to take a subset of the data based on some statistical analysis. To do so, we can get mean, median to set a filtering criteria. Additionally, I will add a ranking column to the dataframe to rank the countries

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##        0       39      280    17319     2342 12168662      345

Now we get a much tider data about 7,000 columns, only 4 columns. However, we can use spread() function to investigate the number of migrants for each year interval.

Test that the file is already created.

Please be sure that you set working directory in Rstudio to the current working directory.

## [1] TRUE

open the file from local machine

Read from GitHub

We can open the file from the GitHub

Visualization

I would prefer to explore the data quickly using the world map. I used ggmap r package to draw world map, then I recoded the country name in the world map - only for the United States of America to be US and United Kingdom to be UK. Finally, I did a left join by region = country, by this, I was able to cast world map to fit my data.

## [1] "Afghanistan" "Albania"     "Algeria"     "Andorra"     "Angola"     
## [6] "Argentina"

As we can see from the plot, The top destination countries that migrant targeting is the United states of America in 2019 with more than 9M migrants. However, we need to get more insights into the last three year intervals, 2010, 2015, 2019.

We got some interesting results here, for instance, United states of America has the highest migrants population over the three time intervals. However, the total migrants that USA accepts decreased by 1 million migrants in 2019. Turkey started accepting migrants in 2019 time interval with a significant number more than 3.5M migrants per interval. As an overall observation, more countries accepting migrants throught time where 2019 interval has the the top 6 countries accepting population.

For the next steps, I am interested in investigating what are the migrants’ origin countries that migrate to the United States and Turkey.

##  Named list()
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

The majority of migrant population to the United States of America is from Mexico. Also, it seams that the Turkey accepted more than 3.7M migrant from Syrian Arab Republic in 2019 interval. The other main origin countries that have most of the population are India, Ukrain, and Russian Federation respectively with no major difference in the total population.

As illustrated, we can concluded that United States is the top destination for Mexicans. However, Indians like to live in the United Arab Emirates at the same time Syrians would rather to flee to Turkey

Dataset_2: UN_Migrant stock by age

Import the dataset

As the previous dataset, we need to extract only the countries not regions. So, we will do the same steps as before

We now have 1,624 row with 62 columns. Now it is time to do some cleaning up.

Remove unnecessary columns

Now we have 1624 rows with a 34 columns

Subset the data - Age categories for both sexes

Since we are focusing here on what is the most age categories that had migrated from their origin countries to another countries. So I decided to subset the data into three subsets, both categories, males only, and females only. We are interested here in both categories.

Rename the columns

Gathering columns

Now we have 18 columns, we need to convert the column type from char to num to be able to complete the summation. Second, use gather() make the data set into long format.

Test that the file is already created.

Please be sure that you set working directory in Rstudio to the current working directory.

## [1] TRUE

open the file from local machine

Read from GitHub

We can open the file from the GitHub

Visualization

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    1227    6068   55709   30883 5485048     480

I used dplyr::leftjoin -> Reduce function along with merge to combine all 4 dataframes into one.

Now based on the previous analysis in Dataset_1, we will pick only the top countries that have majority of population to get more insights how many migrants for each age category. I will look into those the top 6 destinations which are USA, Turkey, United Arab Emirates, India, Ukraine, and Russian Federation.

First, we need to condense the age groups into categories to be easy to visualize. To do that, we can construct a for-loop to iterate over the merged_age dataframe and change the value into the category we want child, youth, adult, and senior migrants as the table below:

child | 0 - 19

youth | 20 - 39

adult | 40 - 59

senior | 60 - 75+

The dataframe now is categorized into 4 age categories; however, the dataframe has NA values need to be replaced by zero.

Here, I wanted to get the age distribution for the best 3 countries to follow the pattern in the first dataset. So I filtered the destination_country and year to get the desired data.

Dataset_3: UN_Migrant_refugee stock total 2019

Import the dataset

We got 284 rows with 25 columns. + get only countries not regions

## [1] 232  25

Remove unnecessary columns

## [1] 232  21

Now we have 232 rows with a 21 columns

Subset the data - Estimate refugee stock including asylum seekers (both sexes)

## [1] 232   8

Rename the columns

Pasring years columns into doubles using parse_number() from readr package

Gathering columns

Make the data more tidy by gather total_estimates by year in 2 columns instead of 7 columns - long format.

## [1] 1624    3

Test that the file is already created.

Please be sure that you set working directory in Rstudio to the current working directory.

## [1] TRUE

open the file from local machine

Read from GitHub

We can open the file from the GitHub

Visualization

To follow the same pattern of analysis, I will filter the destination countries to get the top 4 countries in the last 3 year intervals.

From 2010 to 2019, a significant increase of refugess was estimated in Turkey. Specifically, the refugess estimates increased from 10,000 to 3.7 million, which was the largest number of refugees ever predicted in all countries. Similarly, but with a much lower rate, the number of refugees in the United States increased from 270,000 to 929,762. In contrast, the estimated refugees in the United Arab Emirates was the smallest (<3000).

Conclusions

The analysis gave a basic idea on average migration populations on different areas in the world and how many more migrants these destinations would accept. I would like to extend the study to investigate what factors influence migrants to leave their country and target those destinations, it it life quality, employment opportunities,..etc. Also, I am curious to know what level of education migrant have are they high educational migrant, high skilled technicians,..etc.