title: “Data 607 - Project 2a”

author: “Sufian”

date: “9/30/2019”

output: html_document

Rpub link:

http://rpubs.com/ssufian/535540

Source: United Nations Population Division

     Department of Economic and Social Affairs :  UN  International Migrant Stock

Jai’s list of Questions from his Discussion 5 post:

NOTE: Migrant Data refers to Immigration into area of destination; area, region or countries

Jai posed these questions: Summaries could be done by gender at any of the aggregation levels below

• Gender ratios of migrant stock for each region of the world, for each income group, etc.

• Average gender ratios of world migrant stock.

• What is the variance across countries.

• Is there a trend across years for any of these sequences.

This project will focus only the following 2 regions straddling between high vs. low income:


Loading libraries

Loading Data from United Nations Reports

url <- 'https://raw.githubusercontent.com/ssufian/Data_607/master/UN_MigrantStockTotal_2019%20(1).csv'

# Reading & Loading data
df <- read.csv(file = url ,sep = ",", na.strings = c("NA", " ", ""), strip.white = TRUE, stringsAsFactors = F, skip=13,header=F)

head(df)
##   V1                                                          V2   V3  V4
## 1  1                                                       WORLD <NA> 900
## 2  2                                       UN development groups <NA>  NA
## 3  3                                      More developed regions    b 901
## 4  4                                      Less developed regions    c 902
## 5  5                                   Least developed countries    d 941
## 6  6 Less developed regions, excluding least developed countries <NA> 934
##     V5          V6          V7          V8          V9         V10
## 1 <NA> 153,011,473 161,316,895 173,588,441 191,615,574 220,781,909
## 2 <NA>          ..          ..          ..          ..          ..
## 3 <NA>  82,767,216  92,935,095 103,961,989 116,687,616 130,613,460
## 4 <NA>  70,244,257  68,381,800  69,626,452  74,927,958  90,168,449
## 5 <NA>  11,060,221  11,681,777  10,063,948   9,833,150  10,432,671
## 6 <NA>  59,184,036  56,700,023  59,562,504  65,094,808  79,735,778
##           V11         V12        V13        V14        V15        V16
## 1 248,861,296 271,642,105 77,661,689 81,686,116 88,029,221 97,860,838
## 2          ..          ..         ..         ..         ..         ..
## 3 140,643,317 152,069,261 40,426,798 45,377,588 50,801,898 57,078,401
## 4 108,217,979 119,572,844 37,234,891 36,308,528 37,227,323 40,782,437
## 5  13,631,349  16,289,023  5,550,233  5,824,077  5,033,932  4,987,537
## 6  94,586,630 103,283,821 31,684,658 30,484,451 32,193,391 35,794,900
##           V17         V18         V19        V20        V21        V22
## 1 114,061,680 128,863,389 141,488,004 75,349,784 79,630,779 85,559,220
## 2          ..          ..          ..         ..         ..         ..
## 3  63,408,858  67,824,389  73,765,353 42,340,418 47,557,507 53,160,091
## 4  50,652,822  61,039,000  67,722,651 33,009,366 32,073,272 32,399,129
## 5   5,185,496   6,784,461   8,086,158  5,509,988  5,857,700  5,030,016
## 6  45,467,326  54,254,539  59,636,493 27,499,378 26,215,572 27,369,113
##          V23         V24         V25         V26
## 1 93,754,736 106,720,229 119,997,907 130,154,101
## 2         ..          ..          ..          ..
## 3 59,609,215  67,204,602  72,818,928  78,303,908
## 4 34,145,521  39,515,627  47,178,979  51,850,193
## 5  4,845,613   5,247,175   6,846,888   8,202,865
## 6 29,299,908  34,268,452  40,332,091  43,647,328
#labeling columns 
new_name <- c("Sort","Region","Notes","Code","Data_type","1990.Total","1995.Total","2000.Total","2005.Total","2010.Total","2015.Total","2019.Total","1990.Male","1995.Male","2000.Male","2005.Male","2010.Male","2015.Male","2019.Male","1990.Female","1995.Female","2000.Female","2005.Female","2010.Female","2015.Female","2019.Female")


#Rename Columns
df <- df %>% 
     rename_at(vars(starts_with("V")), funs(gsub(.,"V",new_name))) 
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
## Warning in gsub(., "V", new_name): argument 'pattern' has length > 1 and
## only the first element will be used
head(df)
##   Sort                                                      Region Notes
## 1    1                                                       WORLD  <NA>
## 2    2                                       UN development groups  <NA>
## 3    3                                      More developed regions     b
## 4    4                                      Less developed regions     c
## 5    5                                   Least developed countries     d
## 6    6 Less developed regions, excluding least developed countries  <NA>
##   Code Data_type  1990.Total  1995.Total  2000.Total  2005.Total
## 1  900      <NA> 153,011,473 161,316,895 173,588,441 191,615,574
## 2   NA      <NA>          ..          ..          ..          ..
## 3  901      <NA>  82,767,216  92,935,095 103,961,989 116,687,616
## 4  902      <NA>  70,244,257  68,381,800  69,626,452  74,927,958
## 5  941      <NA>  11,060,221  11,681,777  10,063,948   9,833,150
## 6  934      <NA>  59,184,036  56,700,023  59,562,504  65,094,808
##    2010.Total  2015.Total  2019.Total  1990.Male  1995.Male  2000.Male
## 1 220,781,909 248,861,296 271,642,105 77,661,689 81,686,116 88,029,221
## 2          ..          ..          ..         ..         ..         ..
## 3 130,613,460 140,643,317 152,069,261 40,426,798 45,377,588 50,801,898
## 4  90,168,449 108,217,979 119,572,844 37,234,891 36,308,528 37,227,323
## 5  10,432,671  13,631,349  16,289,023  5,550,233  5,824,077  5,033,932
## 6  79,735,778  94,586,630 103,283,821 31,684,658 30,484,451 32,193,391
##    2005.Male   2010.Male   2015.Male   2019.Male 1990.Female 1995.Female
## 1 97,860,838 114,061,680 128,863,389 141,488,004  75,349,784  79,630,779
## 2         ..          ..          ..          ..          ..          ..
## 3 57,078,401  63,408,858  67,824,389  73,765,353  42,340,418  47,557,507
## 4 40,782,437  50,652,822  61,039,000  67,722,651  33,009,366  32,073,272
## 5  4,987,537   5,185,496   6,784,461   8,086,158   5,509,988   5,857,700
## 6 35,794,900  45,467,326  54,254,539  59,636,493  27,499,378  26,215,572
##   2000.Female 2005.Female 2010.Female 2015.Female 2019.Female
## 1  85,559,220  93,754,736 106,720,229 119,997,907 130,154,101
## 2          ..          ..          ..          ..          ..
## 3  53,160,091  59,609,215  67,204,602  72,818,928  78,303,908
## 4  32,399,129  34,145,521  39,515,627  47,178,979  51,850,193
## 5   5,030,016   4,845,613   5,247,175   6,846,888   8,202,865
## 6  27,369,113  29,299,908  34,268,452  40,332,091  43,647,328

Data Munging using Dplyr & TidyR

df1 <- df

df1[df1==".."] <- "0"

# making dataset long format
df1 <- gather(df1,"year_types","n_years",6:26)
head(df1)
##   Sort                                                      Region Notes
## 1    1                                                       WORLD  <NA>
## 2    2                                       UN development groups  <NA>
## 3    3                                      More developed regions     b
## 4    4                                      Less developed regions     c
## 5    5                                   Least developed countries     d
## 6    6 Less developed regions, excluding least developed countries  <NA>
##   Code Data_type year_types     n_years
## 1  900      <NA> 1990.Total 153,011,473
## 2   NA      <NA> 1990.Total           0
## 3  901      <NA> 1990.Total  82,767,216
## 4  902      <NA> 1990.Total  70,244,257
## 5  941      <NA> 1990.Total  11,060,221
## 6  934      <NA> 1990.Total  59,184,036
df2<-df1 %>% 
  mutate(n_years=str_replace_all(n_years,",","")) %<>% mutate_at(7, as.numeric)

head(df2)
##   Sort                                                      Region Notes
## 1    1                                                       WORLD  <NA>
## 2    2                                       UN development groups  <NA>
## 3    3                                      More developed regions     b
## 4    4                                      Less developed regions     c
## 5    5                                   Least developed countries     d
## 6    6 Less developed regions, excluding least developed countries  <NA>
##   Code Data_type year_types   n_years
## 1  900      <NA> 1990.Total 153011473
## 2   NA      <NA> 1990.Total         0
## 3  901      <NA> 1990.Total  82767216
## 4  902      <NA> 1990.Total  70244257
## 5  941      <NA> 1990.Total  11060221
## 6  934      <NA> 1990.Total  59184036
#segregate total years into male and female years 

separate_DF <- df2 %>% separate(year_types, c("Year", "gender"))
head(separate_DF)
##   Sort                                                      Region Notes
## 1    1                                                       WORLD  <NA>
## 2    2                                       UN development groups  <NA>
## 3    3                                      More developed regions     b
## 4    4                                      Less developed regions     c
## 5    5                                   Least developed countries     d
## 6    6 Less developed regions, excluding least developed countries  <NA>
##   Code Data_type Year gender   n_years
## 1  900      <NA> 1990  Total 153011473
## 2   NA      <NA> 1990  Total         0
## 3  901      <NA> 1990  Total  82767216
## 4  902      <NA> 1990  Total  70244257
## 5  941      <NA> 1990  Total  11060221
## 6  934      <NA> 1990  Total  59184036
wide_DF <- separate_DF%>% spread(gender, n_years)
head(wide_DF)               
##   Sort                                                      Region Notes
## 1    1                                                       WORLD  <NA>
## 2    2                                       UN development groups  <NA>
## 3    3                                      More developed regions     b
## 4    4                                      Less developed regions     c
## 5    5                                   Least developed countries     d
## 6    6 Less developed regions, excluding least developed countries  <NA>
##   Code Data_type Year   Female     Male     Total
## 1  900      <NA> 1990 75349784 77661689 153011473
## 2   NA      <NA> 1990        0        0         0
## 3  901      <NA> 1990 42340418 40426798  82767216
## 4  902      <NA> 1990 33009366 37234891  70244257
## 5  941      <NA> 1990  5509988  5550233  11060221
## 6  934      <NA> 1990 27499378 31684658  59184036
no_zero_DF_wide <- wide_DF %>% filter(Female != 0)

# Drop the unnecessary columns of the dataframe

no_zero_DF_wide <- select (no_zero_DF_wide,-c(Notes,Code,Data_type)) %>% mutate_at(3, as.integer) 
              
head(no_zero_DF_wide)
##   Sort                                                      Region Year
## 1    1                                                       WORLD 1990
## 2    3                                      More developed regions 1990
## 3    4                                      Less developed regions 1990
## 4    5                                   Least developed countries 1990
## 5    6 Less developed regions, excluding least developed countries 1990
## 6    8                                       High-income countries 1990
##     Female     Male     Total
## 1 75349784 77661689 153011473
## 2 42340418 40426798  82767216
## 3 33009366 37234891  70244257
## 4  5509988  5550233  11060221
## 5 27499378 31684658  59184036
## 6 37812794 39990074  77802868
no_zero_DF1 <- gather(no_zero_DF_wide, "gender","N_years",4:6)
                     
head(no_zero_DF1)                   
##   Sort                                                      Region Year
## 1    1                                                       WORLD 1990
## 2    3                                      More developed regions 1990
## 3    4                                      Less developed regions 1990
## 4    5                                   Least developed countries 1990
## 5    6 Less developed regions, excluding least developed countries 1990
## 6    8                                       High-income countries 1990
##   gender  N_years
## 1 Female 75349784
## 2 Female 42340418
## 3 Female 33009366
## 4 Female  5509988
## 5 Female 27499378
## 6 Female 37812794

World migratory Variations over time

# comparing world migration variances over ALL Regions over time

world_migration_wide <- no_zero_DF_wide %>% 
                   group_by(Region, Year) %>% 
                   mutate(femalepct = Female/Total) %>% 
                   mutate(malepct = Male/Total) 
                  

head(world_migration_wide)
## # A tibble: 6 x 8
## # Groups:   Region, Year [6]
##    Sort Region                 Year  Female   Male  Total femalepct malepct
##   <int> <chr>                 <int>   <dbl>  <dbl>  <dbl>     <dbl>   <dbl>
## 1     1 WORLD                  1990  7.53e7 7.77e7 1.53e8     0.492   0.508
## 2     3 More developed regio~  1990  4.23e7 4.04e7 8.28e7     0.512   0.488
## 3     4 Less developed regio~  1990  3.30e7 3.72e7 7.02e7     0.470   0.530
## 4     5 Least developed coun~  1990  5.51e6 5.55e6 1.11e7     0.498   0.502
## 5     6 Less developed regio~  1990  2.75e7 3.17e7 5.92e7     0.465   0.535
## 6     8 High-income countries  1990  3.78e7 4.00e7 7.78e7     0.486   0.514
dfhist <- world_migration_wide %>% 
         group_by(Year) #%>% 

# Overlaid histograms
pf <- ggplot(dfhist, aes(x=femalepct, color=Year)) +
  geom_histogram(fill="red", alpha=0.5, position="identity")+facet_grid(Year ~ .)
pf
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

pm <- ggplot(dfhist, aes(x=malepct, color=Year)) +
  geom_histogram(fill="yellow", alpha=0.5, position="identity")+facet_grid(Year ~ .)
pm
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Observation 3b:

Variance Analysis showed that both men and women Total World migratory distributions over time

exhibited normality.

The only difference was women had a left-skew while men had a right skew. Their central

tendencies were very similar as well. The right skewness in men also confirmed the earlier box

plots of the hight income & high developed region groupings which was showing that men relative

to women had a higher spread due to longer positive tails.


Summary Page

This short study had shown very interesting migatory behaviors between men and women over the

time periods: 1990 to 2019. In general, men seem to be more mobile and was able to move into

higher income countries. It was also shown that men were able to move more into the "less

developed regions" as well. However, what is paradoxical, were the trends that showed women

were outpacing men in the “developed regions” and was also better in the low income

countries. This was an ironic “finding” that deserved further investigation and analysis to

say the least. Because these two sets of findings seems to be in contradiction. The next

step of this study was to truly separated out the traditionally known rich nations relative to

to the poorer nations. I randomly picked 7 poor nations from each hemisphere and compared it

to the G7 countries:

The extra step show the following observations:

  • Once countries were unpacked, we clearly saw that women tend to be more mobile relative to

men at 51% vs. 49%. This statistics curiously was exactly the opposite in poorer countries

with men having the slight advantage.

  • It was also shown that within the poor countries; Venezuela, Syria and Mexico had the most

migration flow activities, most probably stemming from its internal socio-economic issues