UN- International Migration Stock Analysis

Abstract

The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work. Your task is to:

Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets: +Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item.
Your code should be in an R Markdown file, posted to rpubs.com, and should include
- narrative descriptions of your data cleanup work, analysis, and conclusions.
- Please include in your homework submission, for each of the three chosen datasets:
The URL to the .Rmd file in your GitHub repository, and
The URL for your rpubs.com web page.

Introduction

The United Nations, Department of Economic and Social Affairs published dataset presents estimates of international migrant by age, sex and origin. Estimates are presented for 1990, 1995, 2000, 2005, 2010, 2015 and 2019 and are available for all countries and areas of the world. The estimates are based on official statistics on the foreign-born or the foreign population.

Total international migrant stock By destination and origin - Table - 1
Total international migrants by age - Table - 1
Total international migrant estimates - Table - 6

Questions

I decided to get more insight about the data provided by the United Nation for the international migrant stock on the last decade. Some of the questions I want to invistigate is:

Which countries are the best destination for migrants.
What are the top migrants country of origin?
What is the age categories for the migrants? adult, children, or seniors
How many refugees top cdestination countries can accept per time interval?

Libraries used

library(readr)
library(dplyr)
library(janitor)
library(tidyr)
library(ggmap)
library(ggplot2)
library(RCurl)

Dataset_1: International migrant stock By destination and origin

I will start by investigating on the total number of migrants by destination and origin country. First, we read the dataset into R.

Read the data into R

The files provided by the UN are Excel files, each file contains several tables nested as worksheets. To be able to read a specific table, I downloaded the file, then save as the file as a csv, each table into individual file (the folders are included in GihHub repo). After that I was able to read the file using readr library using read_csv(). I used skip = 16 argument inside read_csv to remove the worksheet header (United nation information, title,…etc).

# read the file into R - use skip to remove 
UN_MigrantStock <- read_csv("UN_MigrantStockByOriginAndDestination_2019/Table 1-Table 1.csv", 
    skip = 16)
UN_MigrantStock

We get in return 1,991 rows with 246 columns - wide table version. this dataset needs tidying, but to do so, we have to clean up first NA, unused columns, from char to dbl types,…etc.

Remove Regions and keep only Countries

First thing you would notice is that the first 8 rows are not countries, they are regions. This time we want to see how many people are migrating from countries to countries so we can remove these rows for the regions. When you look at ‘X6’ column, it looks that those ‘region’ rows don’t have any value there.

By running the command like below to keep only the rows whose X6 column have NA

# Use filter() from tidyr lib to filter the dataset by only regions 
temp <- filter(UN_MigrantStock ,is.na(X6))
head(temp)

These are all regions, not countries, which means that we can safely remove these rows by adding an exclamation mark ‘!’ right before ‘is.na()’ function like below.

# adding ! to is.na() to get countries only 
UN_Countries <- filter(UN_MigrantStock ,!is.na(X6))
head(UN_Countries)

dim(UN_Countries)

## [1] 1624  246

Now we got a dataframe of 1,624 rows with 246 columns as return with all countries only data per year interval.

Remove unnecessary columns

When you look at the columns we would notice that there are unnecessary columns like ‘Total’, ‘Other South’, etc, because we are interested in estimates of the migrants only for countries to countries. We can remove those unnecessary columns with ‘select’ command along with other unnecessary columns like below. I’m using minus ‘-’ to delete columns and using ‘start_with’ function inside ‘select’ command to delete multiple columns whose names matche the text pattern of “Other”.

UN_Countries_df <- UN_Countries %>% 
  select(-X2, -X4, -X5, -X6, -Total, -starts_with("Other"))
head(UN_Countries_df)

dim(UN_Countries_df)

## [1] 1624  239

We now have 239 columns with 1,624 rows in total as a dataframe with only country to country data.

Rename Columns

UN_Countries_rename <- UN_Countries_df %>% 
  rename(
    year = X1, 
    destination_country = X3
    )
head(UN_Countries_rename)

Gather 234 columns to make it tidy

Now, it’s ready to tidy this ‘matrix’-ish data form by using ‘gather’ command from tidyr package. I will gather all origin countries in one column with a corresponding migrants number. Basically, we will gather from column number 4 to column 234 into two columns origin_country and migrants. I used colnames() to get the exact columns number

colnames(UN_Countries_rename)

##   [1] "year"                              
##   [2] "destination_country"               
##   [3] "Afghanistan"                       
##   [4] "Albania"                           
##   [5] "Algeria"                           
##   [6] "American Samoa"                    
##   [7] "Andorra"                           
##   [8] "Angola"                            
##   [9] "Anguilla"                          
##  [10] "Antigua and Barbuda"               
##  [11] "Argentina"                         
##  [12] "Armenia"                           
##  [13] "Aruba"                             
##  [14] "Australia"                         
##  [15] "Austria"                           
##  [16] "Azerbaijan"                        
##  [17] "Bahamas"                           
##  [18] "Bahrain"                           
##  [19] "Bangladesh"                        
##  [20] "Barbados"                          
##  [21] "Belarus"                           
##  [22] "Belgium"                           
##  [23] "Belize"                            
##  [24] "Benin"                             
##  [25] "Bermuda"                           
##  [26] "Bhutan"                            
##  [27] "Bolivia (Plurinational State of)"  
##  [28] "Bonaire, Sint Eustatius and Saba"  
##  [29] "Bosnia and Herzegovina"            
##  [30] "Botswana"                          
##  [31] "Brazil"                            
##  [32] "British Virgin Islands"            
##  [33] "Brunei Darussalam"                 
##  [34] "Bulgaria"                          
##  [35] "Burkina Faso"                      
##  [36] "Burundi"                           
##  [37] "Cabo Verde"                        
##  [38] "Cambodia"                          
##  [39] "Cameroon"                          
##  [40] "Canada"                            
##  [41] "Cayman Islands"                    
##  [42] "Central African Republic"          
##  [43] "Chad"                              
##  [44] "Channel Islands"                   
##  [45] "Chile"                             
##  [46] "China"                             
##  [47] "China, Hong Kong SAR"              
##  [48] "China, Macao SAR"                  
##  [49] "Colombia"                          
##  [50] "Comoros"                           
##  [51] "Congo"                             
##  [52] "Cook Islands"                      
##  [53] "Costa Rica"                        
##  [54] "Côte d'Ivoire"                     
##  [55] "Croatia"                           
##  [56] "Cuba"                              
##  [57] "Curaçao"                           
##  [58] "Cyprus"                            
##  [59] "Czechia"                           
##  [60] "Dem. People's Republic of Korea"   
##  [61] "Democratic Republic of the Congo"  
##  [62] "Denmark"                           
##  [63] "Djibouti"                          
##  [64] "Dominica"                          
##  [65] "Dominican Republic"                
##  [66] "Ecuador"                           
##  [67] "Egypt"                             
##  [68] "El Salvador"                       
##  [69] "Equatorial Guinea"                 
##  [70] "Eritrea"                           
##  [71] "Estonia"                           
##  [72] "Eswatini"                          
##  [73] "Ethiopia"                          
##  [74] "Falkland Islands (Malvinas)"       
##  [75] "Faroe Islands"                     
##  [76] "Fiji"                              
##  [77] "Finland"                           
##  [78] "France"                            
##  [79] "French Guiana"                     
##  [80] "French Polynesia"                  
##  [81] "Gabon"                             
##  [82] "Gambia"                            
##  [83] "Georgia"                           
##  [84] "Germany"                           
##  [85] "Ghana"                             
##  [86] "Gibraltar"                         
##  [87] "Greece"                            
##  [88] "Greenland"                         
##  [89] "Grenada"                           
##  [90] "Guadeloupe"                        
##  [91] "Guam"                              
##  [92] "Guatemala"                         
##  [93] "Guinea"                            
##  [94] "Guinea-Bissau"                     
##  [95] "Guyana"                            
##  [96] "Haiti"                             
##  [97] "Holy See"                          
##  [98] "Honduras"                          
##  [99] "Hungary"                           
## [100] "Iceland"                           
## [101] "India"                             
## [102] "Indonesia"                         
## [103] "Iran (Islamic Republic of)"        
## [104] "Iraq"                              
## [105] "Ireland"                           
## [106] "Isle of Man"                       
## [107] "Israel"                            
## [108] "Italy"                             
## [109] "Jamaica"                           
## [110] "Japan"                             
## [111] "Jordan"                            
## [112] "Kazakhstan"                        
## [113] "Kenya"                             
## [114] "Kiribati"                          
## [115] "Kuwait"                            
## [116] "Kyrgyzstan"                        
## [117] "Lao People's Democratic Republic"  
## [118] "Latvia"                            
## [119] "Lebanon"                           
## [120] "Lesotho"                           
## [121] "Liberia"                           
## [122] "Libya"                             
## [123] "Liechtenstein"                     
## [124] "Lithuania"                         
## [125] "Luxembourg"                        
## [126] "Madagascar"                        
## [127] "Malawi"                            
## [128] "Malaysia"                          
## [129] "Maldives"                          
## [130] "Mali"                              
## [131] "Malta"                             
## [132] "Marshall Islands"                  
## [133] "Martinique"                        
## [134] "Mauritania"                        
## [135] "Mauritius"                         
## [136] "Mayotte"                           
## [137] "Mexico"                            
## [138] "Micronesia (Fed. States of)"       
## [139] "Monaco"                            
## [140] "Mongolia"                          
## [141] "Montenegro"                        
## [142] "Montserrat"                        
## [143] "Morocco"                           
## [144] "Mozambique"                        
## [145] "Myanmar"                           
## [146] "Namibia"                           
## [147] "Nauru"                             
## [148] "Nepal"                             
## [149] "Netherlands"                       
## [150] "New Caledonia"                     
## [151] "New Zealand"                       
## [152] "Nicaragua"                         
## [153] "Niger"                             
## [154] "Nigeria"                           
## [155] "Niue"                              
## [156] "North Macedonia"                   
## [157] "Northern Mariana Islands"          
## [158] "Norway"                            
## [159] "Oman"                              
## [160] "Pakistan"                          
## [161] "Palau"                             
## [162] "Panama"                            
## [163] "Papua New Guinea"                  
## [164] "Paraguay"                          
## [165] "Peru"                              
## [166] "Philippines"                       
## [167] "Poland"                            
## [168] "Portugal"                          
## [169] "Puerto Rico"                       
## [170] "Qatar"                             
## [171] "Republic of Korea"                 
## [172] "Republic of Moldova"               
## [173] "Réunion"                           
## [174] "Romania"                           
## [175] "Russian Federation"                
## [176] "Rwanda"                            
## [177] "Saint Helena"                      
## [178] "Saint Kitts and Nevis"             
## [179] "Saint Lucia"                       
## [180] "Saint Pierre and Miquelon"         
## [181] "Saint Vincent and the Grenadines"  
## [182] "Samoa"                             
## [183] "San Marino"                        
## [184] "Sao Tome and Principe"             
## [185] "Saudi Arabia"                      
## [186] "Senegal"                           
## [187] "Serbia"                            
## [188] "Seychelles"                        
## [189] "Sierra Leone"                      
## [190] "Singapore"                         
## [191] "Sint Maarten (Dutch part)"         
## [192] "Slovakia"                          
## [193] "Slovenia"                          
## [194] "Solomon Islands"                   
## [195] "Somalia"                           
## [196] "South Africa"                      
## [197] "South Sudan"                       
## [198] "Spain"                             
## [199] "Sri Lanka"                         
## [200] "State of Palestine"                
## [201] "Sudan"                             
## [202] "Suriname"                          
## [203] "Sweden"                            
## [204] "Switzerland"                       
## [205] "Syrian Arab Republic"              
## [206] "Tajikistan"                        
## [207] "Thailand"                          
## [208] "Timor-Leste"                       
## [209] "Togo"                              
## [210] "Tokelau"                           
## [211] "Tonga"                             
## [212] "Trinidad and Tobago"               
## [213] "Tunisia"                           
## [214] "Turkey"                            
## [215] "Turkmenistan"                      
## [216] "Turks and Caicos Islands"          
## [217] "Tuvalu"                            
## [218] "Uganda"                            
## [219] "Ukraine"                           
## [220] "United Arab Emirates"              
## [221] "United Kingdom"                    
## [222] "United Republic of Tanzania"       
## [223] "United States of America"          
## [224] "United States Virgin Islands"      
## [225] "Uruguay"                           
## [226] "Uzbekistan"                        
## [227] "Vanuatu"                           
## [228] "Venezuela (Bolivarian Republic of)"
## [229] "Viet Nam"                          
## [230] "Wallis and Futuna Islands"         
## [231] "Western Sahara"                    
## [232] "Yemen"                             
## [233] "Zambia"                            
## [234] "Zimbabwe"                          
## [235] "X242"                              
## [236] "X243"                              
## [237] "X244"                              
## [238] "X245"                              
## [239] "X246"

Countries_by_no <- gather(UN_Countries_rename, "origin_country", "migrants", 3:234, na.rm = TRUE)
head(Countries_by_no)

# remove extra columns generated 
clean_country_df <- Countries_by_no[, -c(3:7)]
dim(clean_country_df)

## [1] 78754     4

head(clean_country_df)

Now we condensed our data into long version - only 4 columns. However, migrants column is a character type, we need to convert it into double type. I used the parse_number() method inside the readr package to do the job for me.

Perform Statistics

Here we want to know which countries has the highest number of migrant by year. To do so, I would use group_by from tidyr then get the summation of all migrants per destination country.

# change the migrants variable into num type
num_country_df <- clean_country_df
num_country_df$migrants <- parse_number(num_country_df$migrants)
str(num_country_df)

## Classes 'tbl_df', 'tbl' and 'data.frame':    78754 obs. of  4 variables:
##  $ year               : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ destination_country: chr  "Namibia" "South Africa" "Egypt" "Libya" ...
##  $ origin_country     : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ migrants           : num  64 59 237 677 254 ...
##   ..- attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame':  345 obs. of  4 variables:
##   .. ..$ row     : int  496 546 597 648 1397 1419 1589 1644 1938 1956 ...
##   .. ..$ col     : int  NA NA NA NA NA NA NA NA NA NA ...
##   .. ..$ expected: chr  "a number" "a number" "a number" "a number" ...
##   .. ..$ actual  : chr  "-" "-" "-" "-" ...

head(num_country_df)

# group_by year, then sum the migrants per destination country
by_dest_Country_df <- num_country_df %>%
  group_by(year, destination_country, origin_country) %>%
  summarise(total_migrants = sum(migrants)) %>%
  # finally arrange them from big to small
  arrange(desc(total_migrants))

by_dest_Country_df ## this is the dataframe that will be exported

As we can see here, we got more than 78,700 colums. I would prefer to take a subset of the data based on some statistical analysis. To do so, we can get mean, median to set a filtering criteria. Additionally, I will add a ranking column to the dataframe to rank the countries

# first get mean, median, max, min 
summary(by_dest_Country_df$total_migrants) # 17319 -> take 20,000

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##        0       39      280    17319     2342 12168662      345

# take a subset by filtering the results to only pick total migrants > mean

sub_set_migrant <- select(filter(by_dest_Country_df, total_migrants > 20000), c(1:4))

# add ranking system desc. ordered
sub_set_migrant$Rank <-  order(sub_set_migrant$total_migrants, decreasing = TRUE)

sub_set_migrant

Now we get a much tider data about 7,000 columns, only 4 columns. However, we can use spread() function to investigate the number of migrants for each year interval.

gather_by_year <- sub_set_migrant %>%
  spread(key = "year", value = "total_migrants")%>%
  arrange(Rank)

# replace NA with 0 value
gather_by_year[, 4:10][is.na(gather_by_year[, 4:10])] <- 0
gather_by_year

Export to csv file

by_destination_df <- write.table(by_dest_Country_df, file = "migrantByOriginDestination.csv", row.names = FALSE, na="", col.names = TRUE, sep = ",")

Test that the file is already created.

Please be sure that you set working directory in Rstudio to the current working directory.

file_test("-f", "~/Desktop/MSDS_2019/Fall2019/aquisition_management_607/week_6/project_2/migrantByOriginDestination.csv")

## [1] TRUE

open the file from local machine

dest_origin_df <- read.csv(file = "migrantByOriginDestination.csv", header = TRUE, sep = ",")

dest_origin_df

Read from GitHub

We can open the file from the GitHub

url <- 'https://raw.githubusercontent.com/salma71/MSDS_2019/master/Fall2019/aquisition_management_607/week_6/project_2/migrantByOriginDestination.csv'

# reading the url as a dataframe
by_origin_destination <- read.csv(url, header = TRUE, stringsAsFactors = FALSE)
head(by_origin_destination)

Visualization

I would prefer to explore the data quickly using the world map. I used ggmap r package to draw world map, then I recoded the country name in the world map - only for the United States of America to be US and United Kingdom to be UK. Finally, I did a left join by region = country, by this, I was able to cast world map to fit my data.

map_world <- map_data("world")
#map_world$region
subset_migrant <- sub_set_migrant
#recode the country name to be able to left join the two data frames
head(as.factor(subset_migrant$destination_country) %>% 
  levels())

## [1] "Afghanistan" "Albania"     "Algeria"     "Andorra"     "Angola"     
## [6] "Argentina"

# RECODE NAMES
subset_migrant$destination_country <- recode(subset_migrant$destination_country
                                  ,'United States of America' = 'USA'
                                  ,'United Kingdom' = 'UK'
                                  )
#rename the column to be the same as world_map, so destination_country becomes country
colnames(subset_migrant)[2] <- "country"
#View(new_dat)
# LEFT JOIN -- we join region from map into country in subset_migrant.
map_world_joined <- left_join(map_world, subset_migrant, by = c('region' = 'country'))

mapp <- map_world_joined %>% 
  mutate(fill_flg = ifelse(is.na(year),F,T)) %>%
  filter(year == "2019")

ggplot() +
  geom_polygon(data = mapp, aes(x = long, y = lat, group = group, fill = total_migrants)) +
  labs(title = 'Countries with highest "Migrant Population in 2019"') +
  geom_point() +
  theme(text = element_text(family = "Gill Sans")
        ,plot.title = element_text(size = 15)
        ,plot.subtitle = element_text(size = 10)
        ,legend.text = element_text(size = 10)
)

As we can see from the plot, The top destination countries that migrant targeting is the United states of America in 2019 with more than 9M migrants. However, we need to get more insights into the last three year intervals, 2010, 2015, 2019.

# filtering the number of migrants that is more than 3M in 2010, 2015, and 2019
migration_15_19 = sub_set_migrant %>%
  filter(total_migrants >= 3000000 & year %in% c(2010 ,2015, 2019)) %>%
  select(destination_country, origin_country, year, total_migrants)%>%
  arrange(desc(total_migrants))

migration_15_19

ggplot(migration_15_19, aes(x=destination_country, y=total_migrants, fill = destination_country)) + 
  geom_bar(stat="identity", width=1, position = position_dodge()) + 
  labs(fill="destination country") +
  labs(title="Migrant population bar chart", 
       subtitle="Top destinations for migrats population - more than 3M migrants per year") +
  theme_bw() + 
  theme(axis.text.x=element_blank(), axis.ticks.x=element_blank()) +
  facet_grid( ~ year)

We got some interesting results here, for instance, United states of America has the highest migrants population over the three time intervals. However, the total migrants that USA accepts decreased by 1 million migrants in 2019. Turkey started accepting migrants in 2019 time interval with a significant number more than 3.5M migrants per interval. As an overall observation, more countries accepting migrants throught time where 2019 interval has the the top 6 countries accepting population.

For the next steps, I am interested in investigating what are the migrants’ origin countries that migrate to the United States and Turkey.

by_origin_country <- migration_15_19 %>%
  filter(destination_country %in% c("United States of America", "United Arab Emirates", "Ukraine", "Turkey", "Russian Federation", "India") & total_migrants >= 3000000) %>%
  select(destination_country, origin_country, total_migrants, year) 
by_origin_country

theme()

##  Named list()
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

ggplot(by_origin_country, aes(x = destination_country , y = total_migrants, colour = origin_country)) +
  geom_point(size = 3) + 
  facet_grid(~ by_origin_country$year) + 
  labs(y = "total_migrants") +
  scale_y_continuous() +
  theme(axis.text.x = element_text(angle=65, vjust=0.65)) +
  theme_linedraw()

The majority of migrant population to the United States of America is from Mexico. Also, it seams that the Turkey accepted more than 3.7M migrant from Syrian Arab Republic in 2019 interval. The other main origin countries that have most of the population are India, Ukrain, and Russian Federation respectively with no major difference in the total population.

only_3 <- migration_15_19%>%
  filter(destination_country %in% c("United States of America", "United Arab Emirates", "Turkey") & year == "2019") %>%
  select(destination_country, origin_country, total_migrants) 

only_3

As illustrated, we can concluded that United States is the top destination for Mexicans. However, Indians like to live in the United Arab Emirates at the same time Syrians would rather to flee to Turkey

Dataset_2: UN_Migrant stock by age

Import the dataset

UN_Migrant_age <- read_csv("UN_MigrantStockByAgeAndSex_2019/Table 1-Table 1.csv", skip = 15)
UN_Migrant_age

As the previous dataset, we need to extract only the countries not regions. So, we will do the same steps as before

UN_Countries_age <- filter(UN_Migrant_age ,!is.na(X6))
UN_Countries_age

We now have 1,624 row with 62 columns. Now it is time to do some cleaning up.

Remove unnecessary columns

migrant_age_df <- UN_Countries_age %>% 
  select(-X2, -X4, -X5, -X6, -starts_with("Total"), -c(58:62))
migrant_age_df

Now we have 1624 rows with a 34 columns

Subset the data - Age categories for both sexes

Since we are focusing here on what is the most age categories that had migrated from their origin countries to another countries. So I decided to subset the data into three subsets, both categories, males only, and females only. We are interested here in both categories.

#Use select function to fetch only the data that we interested by picking the colnames
both_sex_df <- migrant_age_df %>%
  select(c("X1":"75+"))

both_sex_df

Rename the columns

both_sex_rename <- both_sex_df %>% 
  rename(
    year = X1, 
    destination_country = X3, 
    )
both_sex_rename

Gathering columns

Now we have 18 columns, we need to convert the column type from char to num to be able to complete the summation. Second, use gather() make the data set into long format.

num_age_df <- both_sex_rename
num_age_df$`0-4` <- parse_number(num_age_df$`0-4`)
num_age_df$`5-9` <- parse_number(num_age_df$`5-9`)
num_age_df$`10-14` <- parse_number(num_age_df$`10-14`)
num_age_df$`15-19` <- parse_number(num_age_df$`15-19`)
num_age_df$`20-24` <- parse_number(num_age_df$`20-24`)
num_age_df$`25-29` <- parse_number(num_age_df$`25-29`)
num_age_df$`30-34` <- parse_number(num_age_df$`30-34`)
num_age_df$`35-39` <- parse_number(num_age_df$`35-39`)
num_age_df$`40-44` <- parse_number(num_age_df$`40-44`)
num_age_df$`45-49` <- parse_number(num_age_df$`45-49`)
num_age_df$`50-54` <- parse_number(num_age_df$`50-54`)
num_age_df$`55-59` <- parse_number(num_age_df$`55-59`)
num_age_df$`60-64` <- parse_number(num_age_df$`60-64`)
num_age_df$`65-69` <- parse_number(num_age_df$`65-69`)
num_age_df$`70-74` <- parse_number(num_age_df$`70-74`)
num_age_df$`75+` <- parse_number(num_age_df$`75+`)

num_age_df

#age_category <- both_sex_rename %>%
#  summarise(child = sum(c(3:6)))

# gather total population by age category 
by_age <- num_age_df %>%
  gather(age, total_population, c(3:18))
by_age

Export to csv file

migrant_by_age <- write.table(by_age, file = "migrantByAge.csv", row.names = FALSE, na="", col.names = TRUE, sep = ",")

Test that the file is already created.

Please be sure that you set working directory in Rstudio to the current working directory.

file_test("-f", "~/Desktop/MSDS_2019/Fall2019/aquisition_management_607/week_6/project_2/migrantByAge.csv")

## [1] TRUE

open the file from local machine

migrant_byage_df <- read.csv(file = "migrantByAge.csv", header = TRUE, sep = ",")

migrant_byage_df

Read from GitHub

We can open the file from the GitHub

url <- 'https://raw.githubusercontent.com/salma71/MSDS_2019/master/Fall2019/aquisition_management_607/week_6/project_2/migrantByAge.csv'

# reading the url as a dataframe
by_age_bothsex <- read.csv(url, header = TRUE, stringsAsFactors = FALSE)
head(by_age_bothsex)

Visualization

#check how many youth 
summary(by_age$total_population)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    1227    6068   55709   30883 5485048     480

by_youth <- by_age%>%
  filter(age %in% c("20-24", "25-29", "30-34", "35-39") & total_population >= 55709)%>%
  select(year, destination_country, total_population)%>%
  arrange(desc(total_population))

by_youth

theme_set(theme_classic())
ggplot(by_youth, aes(x=year, y=total_population)) + 
  geom_point(col="tomato2", size=3) +   # Draw points
  geom_segment(aes(x=year, 
                   xend=year, 
                   y=min(total_population), 
                   yend=max(total_population)), 
               linetype="dashed", 
               size=0.1) +   # Draw dashed lines
  labs(title="Dot Plot", 
       subtitle="Youth population over the years", 
       caption="source: by_youth") +  
  coord_flip()

#check how many children 

by_child <- by_age%>%
  filter(age %in% c("0-4", "5-9", "10-14", "15-19") & total_population >= 55709)%>%
  select(year, destination_country, total_population, age)%>%
  arrange(desc(total_population))

by_child

#check how many adults 

by_adult <- by_age%>%
  filter(age %in% c("40-44", "45-49", "50-54", "55-59") & total_population >= 55709)%>%
  select(year, destination_country, total_population, age)%>%
  arrange(desc(total_population))

by_adult

#check how many senior 

by_senior <- by_age%>%
  filter(age %in% c("60-64", "65-69", "70-74", "75+") & total_population >= 55709)%>%
  select(year, destination_country, total_population, age)%>%
  arrange(desc(total_population))

by_senior

I used dplyr::leftjoin -> Reduce function along with merge to combine all 4 dataframes into one.

merged_age <- Reduce(function(...) merge(..., all=TRUE, sort = FALSE), list(by_child, by_youth, by_adult, by_senior))
merged_age

Now based on the previous analysis in Dataset_1, we will pick only the top countries that have majority of population to get more insights how many migrants for each age category. I will look into those the top 6 destinations which are USA, Turkey, United Arab Emirates, India, Ukraine, and Russian Federation.

First, we need to condense the age groups into categories to be easy to visualize. To do that, we can construct a for-loop to iterate over the merged_age dataframe and change the value into the category we want child, youth, adult, and senior migrants as the table below:

child | 0 - 19

youth | 20 - 39

adult | 40 - 59

senior | 60 - 75+

for (i in 1:nrow(merged_age)) {
    if(merged_age[[i,4]] %in% c("0-4" ,  "5-9" ,  "10-14" ,"15-19")) {
      merged_age[[i,4]] <- "child"
    } else if(merged_age[[i,4]] %in% c("20-24", "25-29" ,"30-34" ,"35-39")) {
      merged_age[[i,4]] <- "youth"
    } else if(merged_age[[i,4]] %in% c("40-44" ,"45-49" ,"50-54" ,"55-59")) {
      merged_age[[i,4]] <- "adult"
    } else if(merged_age[[i,4]] %in% c("60-64", "65-69", "70-74", "75+")) {
      merged_age[[i,4]] <- "senior"
    }
  merged_age
}
merged_age

The dataframe now is categorized into 4 age categories; however, the dataframe has NA values need to be replaced by zero.

#replace NA <- 0
merged_age[is.na(merged_age)] <- 0

# filter the results to get only the non-zero values, 3 years intervals, and 5 top countries desc. arranged.
top_migrant_countries <- merged_age %>%
  filter(destination_country %in% c("United States of America", "United Arab Emirates", "Ukraine", "Turkey", "Russian Federation", "India") & year %in% c(2010, 2015,2019) & age != 0) %>%
  select(destination_country, age, total_population, year)%>%
  arrange(desc(total_population))
  
top_migrant_countries

ggplot(top_migrant_countries, aes(x=age, y=total_population, fill = destination_country)) + 
  geom_bar(stat="identity",position=position_dodge()) + 
  labs(title="Migrant age-categories during 3 intervals", 
       subtitle="Top migrats age-categories for the top 6 destination countries") + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
  facet_wrap(~year) +
  theme_light()

Here, I wanted to get the age distribution for the best 3 countries to follow the pattern in the first dataset. So I filtered the destination_country and year to get the desired data.

best_3 <- merged_age%>%
  filter(destination_country %in% c("United States of America", "United Arab Emirates", "Turkey") & year %in% c("2019") & age != 0) %>%
  select(destination_country, age, total_population, year)%>%
  arrange(desc(total_population))

best_3

ggplot(best_3, aes(x=age, y=total_population, group = age)) + 
  geom_col(aes(fill = age), position = "dodge", stat = "identity") + 
  labs(title="Migrant age-categories distribution in 2019", 
       subtitle="Top migrats age-categories for the top 3 destination countries") + 
  
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
  facet_wrap(~ destination_country) +
  theme_light()

Dataset_3: UN_Migrant_refugee stock total 2019

Import the dataset

library(readr)
UN_Refugee_df <- read_csv("UN_MigrantStockTotal_2019/Table 6-Table 1.csv", 
    skip = 15)
UN_Refugee_df

We got 284 rows with 25 columns. + get only countries not regions

refugee_country_df <- filter(UN_Refugee_df ,!is.na(X5))
dim(refugee_country_df)

## [1] 232  25

refugee_country_df

Remove unnecessary columns

# clean up extra columns
refugee_df <- refugee_country_df %>% 
  select(-X1,-c(3:5))
dim(refugee_df)

## [1] 232  21

refugee_df

Now we have 232 rows with a 21 columns

Subset the data - Estimate refugee stock including asylum seekers (both sexes)

#Use select function to fetch only the data that we interested by picking the colnames - both males, and females
estimate_refugee_df <- refugee_df %>%
  select(c("X2":"2019"))

dim(estimate_refugee_df)

## [1] 232   8

estimate_refugee_df

Rename the columns

estimate_refugee_rename <- estimate_refugee_df %>% 
  rename(
    destination_country = X2, 
    )
estimate_refugee_rename

Pasring years columns into doubles using parse_number() from readr package

num_est_refugee <- estimate_refugee_rename
num_est_refugee$`1990` <- parse_number(num_est_refugee$`1990`)
num_est_refugee$`1995` <- parse_number(num_est_refugee$`1995`)
num_est_refugee$`2000` <- parse_number(num_est_refugee$`2000`)
num_est_refugee$`2005` <- parse_number(num_est_refugee$`2005`)
num_est_refugee$`2010` <- parse_number(num_est_refugee$`2010`)
num_est_refugee$`2015` <- parse_number(num_est_refugee$`2015`)
num_est_refugee$`2019` <- parse_number(num_est_refugee$`2019`)

num_est_refugee

Gathering columns

Make the data more tidy by gather total_estimates by year in 2 columns instead of 7 columns - long format.

# gather() by year into total_estimates - gather columns 2 (1990) : 8(2019) then arrange desc
by_refugee_pop <- num_est_refugee%>%
  gather(year, total_estimates, c(2:8)) %>%
  arrange(desc(total_estimates))
dim(by_refugee_pop)

## [1] 1624    3

by_refugee_pop

Export to csv file

Refugee_estimates <- write.table(by_refugee_pop, file = "refugeeEstimates.csv", row.names = FALSE, na="", col.names = TRUE, sep = ",")

Test that the file is already created.

Please be sure that you set working directory in Rstudio to the current working directory.

file_test("-f", "~/Desktop/MSDS_2019/Fall2019/aquisition_management_607/week_6/project_2/refugeeEstimates.csv")

## [1] TRUE

open the file from local machine

refugeeEstimates <- read.csv(file = "refugeeEstimates.csv", header = TRUE, sep = ",")

refugeeEstimates

Read from GitHub

We can open the file from the GitHub

url <- 'https://raw.githubusercontent.com/salma71/MSDS_2019/master/Fall2019/aquisition_management_607/week_6/project_2/refugeeEstimates.csv'

# reading the url as a dataframe
ref_estimates <- read.csv(url, header = TRUE, stringsAsFactors = FALSE)
head(ref_estimates)

Visualization

To follow the same pattern of analysis, I will filter the destination countries to get the top 4 countries in the last 3 year intervals.

# filter by year, destination, total estimates is not equal to 0 
by_summary <- by_refugee_pop%>%
  filter(destination_country %in% c("United States of America", "Turkey", "United Arab Emirates" ) & year %in% c(2010, 2015,2019) & total_estimates != 0)

by_summary

ggplot(by_summary, aes(x = destination_country, y = total_estimates, fill = destination_country)) +
    geom_bar(stat = "identity",
        width = 1) +
    geom_text(aes(label = total_estimates),
        hjust = .5,
        color = "black",
        size = 3.5) +
  labs(title="Refugee estimates for top 3 destinations", 
       subtitle="How much refugees the top three destinations can accept per year interval") + 
  theme_bw() +
    theme(
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) + 
  facet_wrap(~ year)

From 2010 to 2019, a significant increase of refugess was estimated in Turkey. Specifically, the refugess estimates increased from 10,000 to 3.7 million, which was the largest number of refugees ever predicted in all countries. Similarly, but with a much lower rate, the number of refugees in the United States increased from 270,000 to 929,762. In contrast, the estimated refugees in the United Arab Emirates was the smallest (<3000).

Conclusions

The analysis gave a basic idea on average migration populations on different areas in the world and how many more migrants these destinations would accept. I would like to extend the study to investigate what factors influence migrants to leave their country and target those destinations, it it life quality, employment opportunities,..etc. Also, I am curious to know what level of education migrant have are they high educational migrant, high skilled technicians,..etc.

UN- International Migration Stock Analysis

Salma Elshahawy

9/30/2019

Abstract

Introduction

Questions

Libraries used

Dataset_1: International migrant stock By destination and origin

Read the data into R

Remove Regions and keep only Countries

Remove unnecessary columns

Rename Columns

Gather 234 columns to make it tidy

Perform Statistics

Export to csv file

Test that the file is already created.

open the file from local machine

Read from GitHub

Visualization

Dataset_2: UN_Migrant stock by age

Import the dataset

Remove unnecessary columns

Subset the data - Age categories for both sexes

Rename the columns

Gathering columns

Export to csv file

Test that the file is already created.

open the file from local machine

Read from GitHub

Visualization

Dataset_3: UN_Migrant_refugee stock total 2019

Import the dataset

Remove unnecessary columns

Subset the data - Estimate refugee stock including asylum seekers (both sexes)

Rename the columns

Gathering columns

Export to csv file

Test that the file is already created.

open the file from local machine

Read from GitHub

Visualization

Conclusions

Important links