UN Migration Data

library(knitr)
df <- read.csv('migration.csv', skip = 14, as.is = TRUE, header = FALSE)
my_colnames <- df[1,1:5]
country_colnames <- df[2, 6:ncol(df)]
df <- df[-1:-2,]
names(df) <- c(my_colnames, country_colnames)
df[152, 2] <- 'United Kingdom'
remove <- df[c(1:8, 29, 39, 46, 47, 53, 71, 72, 78, 80:81, 98, 108, 127:128, 139, 153, 170, 180:181, 208, 217, 232, 238:239), 2]
(head(df[,1:8], 10))
##    Sort\norder         Major area, region, country or area of destination Notes
## 3            1                                                      WORLD      
## 4            2                                          Developed regions   (b)
## 5            3                                         Developing regions   (c)
## 6            4                                  Least developed countries   (d)
## 7            5 Less developed regions excluding least developed countries      
## 8            6                                         Sub-Saharan Africa   (e)
## 9            7                                                     Africa      
## 10           8                                             Eastern Africa      
## 11           9                                                    Burundi      
## 12          10                                                    Comoros      
##    Country code Type of data (a)        Total Other North Other South
## 3           900                   243 700 236   2 139 539   7 644 005
## 4           901                   140 481 955     539 780   3 520 214
## 5           902                   103 218 281   1 599 759   4 123 791
## 6           941                    11 951 316     241 805   1 005 567
## 7           934                    91 266 965   1 357 954   3 118 224
## 8           947                    18 993 986     328 171   1 551 645
## 9           903                    20 649 557     350 543   1 586 186
## 10          910                     6 129 113     102 542     457 137
## 11          108              B R      286 810       3 242      20 599
## 12          174                B       12 555         597         754

The main issue with this dataset is that the country of origin for migration is split into 235 columns. We will need to gather these. There is also an issue with the values having spaces in them making them hard to calculate over. First I shall gather the columns. I will also remove some rows that deal with aggregate regions as we are going to be analyzing movement from the US to other specific countries, and as such having aggregated regions will not be useful to our analysis.

library(tidyverse)
library(stringr)
df <- df %>% filter(!(`Major area, region, country or area of destination` %in% remove))
df <- pivot_longer(df, 6:ncol(df), names_to = 'country_of_origin', values_to = 'count')
head(df, 10)
## # A tibble: 10 x 7
##    `Sort\norder` `Major area, re~ Notes `Country code` `Type of data (~
##    <chr>         <chr>            <chr> <chr>          <chr>           
##  1 9             Burundi          ""    108            B R             
##  2 9             Burundi          ""    108            B R             
##  3 9             Burundi          ""    108            B R             
##  4 9             Burundi          ""    108            B R             
##  5 9             Burundi          ""    108            B R             
##  6 9             Burundi          ""    108            B R             
##  7 9             Burundi          ""    108            B R             
##  8 9             Burundi          ""    108            B R             
##  9 9             Burundi          ""    108            B R             
## 10 9             Burundi          ""    108            B R             
## # ... with 2 more variables: country_of_origin <chr>, count <chr>

Lets quickly tidy the count column so that it is usable for analysis.

df$count <- as.integer(str_replace_all(df$count, ' ', ''))
head(df, 10)
## # A tibble: 10 x 7
##    `Sort\norder` `Major area, re~ Notes `Country code` `Type of data (~
##    <chr>         <chr>            <chr> <chr>          <chr>           
##  1 9             Burundi          ""    108            B R             
##  2 9             Burundi          ""    108            B R             
##  3 9             Burundi          ""    108            B R             
##  4 9             Burundi          ""    108            B R             
##  5 9             Burundi          ""    108            B R             
##  6 9             Burundi          ""    108            B R             
##  7 9             Burundi          ""    108            B R             
##  8 9             Burundi          ""    108            B R             
##  9 9             Burundi          ""    108            B R             
## 10 9             Burundi          ""    108            B R             
## # ... with 2 more variables: country_of_origin <chr>, count <int>

The original poster indicated that some of the columns may not be useful to this analysis, so I will remove the columns that seem unnecessary.

df <- df[,c(-1, -3, -5)]
head(df, 10)
## # A tibble: 10 x 4
##    `Major area, region, country or area ~ `Country code` country_of_orig~  count
##    <chr>                                  <chr>          <chr>             <int>
##  1 Burundi                                108            Total            286810
##  2 Burundi                                108            Other North        3242
##  3 Burundi                                108            Other South       20599
##  4 Burundi                                108            Afghanistan          NA
##  5 Burundi                                108            Albania              NA
##  6 Burundi                                108            Algeria              NA
##  7 Burundi                                108            American Samoa       NA
##  8 Burundi                                108            Andorra              NA
##  9 Burundi                                108            Angola               NA
## 10 Burundi                                108            Anguilla             NA

Let us focus in on the United States and see how many immigrants are entering the US.

names(df)[1:2] <- c('country_of_destination', 'country_code')
us_dest <- df %>% filter(country_of_destination == 'United States of America')
sum(us_dest$count, na.rm = TRUE)
## [1] 93254204
top10 <- us_dest %>% arrange(desc(count)) %>% .[2:11,]
top10 %>% ggplot(aes(reorder(country_of_origin, desc(count)), count, fill = country_of_origin)) +
          geom_bar(stat = 'identity') + xlab('Origin Country') + theme(axis.text.x = element_text(angle = 90))+
          ylab('Number of Immigrants') + ggtitle('Top 10 countries by Immigrants to US')

We see that somewhat unsurprisingly Mexico provides a huge proportion of the immigrants into the United States. The total number of immigrants is approximately 93 million, so Mexico makes up more than 1/8 of the total immigrant population. Now let us flip the script and see where US emmigrants are going to.

us_orig <- df %>% filter(country_of_origin == 'United States of America')
sum(us_orig$count, na.rm = TRUE)
## [1] 3107839
top10 <- us_orig %>% arrange(desc(count)) %>% .[c(1:10),]
top10 %>% ggplot(aes(reorder(country_of_destination, desc(count)), count, fill = country_of_destination)) +
          geom_bar(stat = 'identity') + xlab('Destination Country') + 
          theme(axis.text.x = element_text(angle = 90)) +
          ylab('Number of Immigrants') + ggtitle('Top 10 countries by Emmigrants from the US')

Again, Mexico receives the highest number of US emmigrants, although the number is much smaller than the number of immigrants from Mexico. However, the total number of emmigrants is only 3 million, so the proportion of Americans emmigrating to Mexico is almost 1/4 of all total emmigrants. One last analysis I would like to run is to find which countries had the highest difference between immigrants and emmigrants from the US.

diff <- inner_join(us_dest, us_orig, by = c('country_of_origin' = 'country_of_destination'))
diffobj <- diff %>% group_by(country_of_origin) %>% summarize(diff = sum(c(count.x, -count.y), na.rm = TRUE))
top10 <- diffobj %>% arrange(desc(diff)) %>% .[1:10,]
bot10 <- diffobj %>% arrange(diff) %>% .[1:10,]
top10 %>% ggplot(aes(reorder(country_of_origin, desc(diff)), diff, fill = country_of_origin)) +
          geom_bar(stat = 'identity') + xlab('Country') + 
          theme(axis.text.x = element_text(angle = 90)) +
          ylab('Difference in Immigrants and Emmigrants') + 
          ggtitle('Top 10 countries by Difference in Immigration')

bot10 %>% ggplot(aes(reorder(country_of_origin, diff), diff, fill = country_of_origin)) +
          geom_bar(stat = 'identity') + xlab('Country') + 
          theme(axis.text.x = element_text(angle = 90)) +
          ylab('Difference in Immigrants and Emmigrants') + 
          ggtitle('Bottom 10 countries by Difference in Immigration')

Once again somewhat unsuprisingly, we find that Mexico has a large difference in the number of immigrants to the US compared to the number of emmigrants from the US. However, we find some other interesting insights such as the fact that more Americans emmigrate to Australia than immigrate from Australia, and America has high net immigration from many populous Asian countries such as China and India.