Using tidyr, dplyr, dbplyr, RMySQL, ggplot2, stringr, RCurl

UN Immigration Data

I will be using the data source from Subhalaxmi Rout regarding UN Migrartion Data.

This data set is huge, so I’m only looking at one table, which cooresponds to the year 1990. Once a pipline can be established, I can tidy a few more tables, join them and look at some trends.

SQL Server for Data

There is a lot of data here so it will make sense to store the results of the tidying on SQL server. I have turned on my GCP SQL server (which is password protected) for this. If you would like to reproduce this work from stratch, please ask for the password. The creation of the DB table can be done from RMySQL, which is nice. The code creating the DB is below in markdown since I only needed to run the code once to create the UN_immg DB.

conn <-
    dbConnect(
      RMySQL::MySQL(),
      username = "root",
      password = rstudioapi::askForPassword("Database password"),
      host = '34.68.193.229'
    )
dbSendQuery(conn, "CREATE DATABASE UN_immg;")

Once created the table UN_immg I can use the following connection to the UN_immg DB name.

conn<-
      dbConnect(
      RMySQL::MySQL(),
      username = "root",
      password = params$pwd,
      # For Rmd use: rstudioapi::askForPassword("Database password")
      # For knitting use params$pwd
      host = '34.68.193.229',
      dbname = "UN_immg"
    ) 
dbSendQuery(conn, "USE UN_immg")

## <MySQLResult:NA,0,0>

Load Data

The data comes from a massive excel file with multiple tables. I’ll start with one table from 1990. This is an excerpt of its initial state.

m<-getURL(
"https://raw.githubusercontent.com/Shampjeff/cuny_msds/master/DATA_607/data/un_immg.csv")
df_immg<-read.csv(text=m, header = T, na.strings = "", skip = 14)

Looking at the initial state of the data.

DT::datatable(df_immg[1:5,1:5], 
         extensions = c('FixedColumns',"FixedHeader"),
          options = list(scrollX = TRUE, 
                         paging=TRUE,
                         fixedHeader=TRUE))

I’ll need to rename columns so I am re-using a function I wrote previously to replace the column names with a value found in a given row.

rename.columns<-function(df, row_idx, col_start, name_join=F){
  for (i in col_start:ncol(df)){                                
    prefix<-c(str_remove(names(df[i]), "[.]\\d"), df[row_idx,i])
    join_name<-paste(prefix, collapse="_")  
    if (name_join == FALSE){
      names(df)[names(df) == names(df[i])]<-df[row_idx,i]
    } else {
      names(df)[names(df) == names(df[i])]<-join_name
    }
  }
  return(df)
}

Destination is where people ended up, country of origin is the total number of people who came to the destination column, the rest of the columns outline the numbers from each other country recognized by the UN. These are to columns I will tidy.

df_immg<-df_immg %>%                         
  select(Major.area..region..country.or.area.of.destination, 
         Country.of.origin, 
         X.2:ncol(df_immg)) %>%
  mutate_if(is.factor, as.character) %>%     # change data types
  rename(                                    # replace unruly names
    destination = Major.area..region..country.or.area.of.destination,
    inbound_total = Country.of.origin) %>%
  replace(is.na(.), 0)                       # replacing NA with zero

DT::datatable(df_immg[1:5,1:5], 
         extensions = c('FixedColumns',"FixedHeader"),
          options = list(scrollX = TRUE, 
                         paging=TRUE,
                         fixedHeader=TRUE))

Again, we need to rename the columns based on a row value.

df_immg<-rename.columns(df_immg, 
                        row_idx = 1, 
                        col_start = 3)
df_immg<-df_immg[2:length(df_immg), 1:234]   # drop that last 10 rows (NAs)
df_immg$year = 1990
DT::datatable(df_immg[1:5,1:4], 
         extensions = c('FixedColumns',"FixedHeader"),
          options = list(scrollX = TRUE, 
                         paging=TRUE,
                         fixedHeader=TRUE))

This dataset is almost square, to make it tidy I think we want to pivot this to three columns: to, form, total moved. Once in a tidy format, we can remove whitespace from the numbers and cast them as numerics.

df_immg<-df_immg %>%  
    pivot_longer(                       # gather columns to rows   
                 -c("destination", 
                    "inbound_total",
                    "year"),
                 names_to = "origin",    # naming params
                 values_to = "total", 
                 names_repair = "unique") %>%        
    mutate(total = str_remove_all(total, " ")) %>%   
    mutate(inbound_total = str_remove_all(inbound_total, " ")) %>%
    mutate_at(.vars=vars(total, inbound_total), .funs = as.numeric) 

DT::datatable(df_immg, 
         extensions = c('FixedColumns',"FixedHeader"),
          options = list(scrollX = TRUE, 
                         paging=TRUE,
                         fixedHeader=TRUE,
                         pageLength = 5))

57,536 rows with many repeat values and destinations. Is this really more tidy?

Migration Data Pipeline

Now that our process is in place, I’ll bring in and tidy several other years; 1995, 2000, 2005, 2010, and 2015.

urls<-c(
  "https://raw.githubusercontent.com/Shampjeff/cuny_msds/master/DATA_607/data/un_immg_1995.csv",
  "https://raw.githubusercontent.com/Shampjeff/cuny_msds/master/DATA_607/data/un_immg_2k.csv",
  "https://raw.githubusercontent.com/Shampjeff/cuny_msds/master/DATA_607/data/un_immg_2005.csv",
  "https://raw.githubusercontent.com/Shampjeff/cuny_msds/master/DATA_607/data/un_immg_2010.csv",
  "https://raw.githubusercontent.com/Shampjeff/cuny_msds/master/DATA_607/data/un_immg_2015.csv")
data_years<-c(1995,2000,2005,2010,2015)

for (i in 1:length(data_years)){

  d<-getURL(urls[i])
  df_immg_i<-read.csv(text=d, header = T, na.strings = "", skip = 14)
  
  df_immg_i<-df_immg_i %>%
    select(Major.area..region..country.or.area.of.destination,
           Country.of.origin,
           X.2:ncol(df_immg_i)) %>%
    mutate_if(is.factor, as.character) %>%     # change data types
    rename(                                    # replace unruly names
      destination = Major.area..region..country.or.area.of.destination,
      inbound_total = Country.of.origin) %>%
    replace(is.na(.), 0)                       # replacing NA with zero

  df_immg_i<-rename.columns(df_immg_i,
                          row_idx = 1,
                          col_start = 3)[2:length(df_immg_i), 1:234]

  df_immg_i<-df_immg_i %>%
      pivot_longer(                           # gather columns to rows
                   -c("destination",
                      "inbound_total"),
                   names_to = "origin",       # naming params
                   values_to = "total",
                   names_repair = "unique") %>%
      mutate(total = str_remove_all(total, " ")) %>%
      mutate(inbound_total = str_remove_all(inbound_total, " ")) %>%
      mutate_at(.vars=vars(total, inbound_total), .funs = as.numeric) %>%
      mutate(year = data_years[i])
  df_immg<-rbind(df_immg, df_immg_i)
}

DT::datatable(df_immg, 
         extensions = c('FixedColumns',"FixedHeader"),
          options = list(scrollX = TRUE, 
                         paging=TRUE,
                         fixedHeader=TRUE,
                         pageLength = 5))

That’s one huge dataframe - 334,776 rows. I’ve written it to a SQL Server for faster handling of the data. It is rendered in markdown below because I only need to write the data to UN_immg DB once. Once all the data was written to a database, I can reassign the variable df_immg to the dbplyr function tbl() and all my data wrangling will be sent as SQL queries to the database. Pretty nice feature.

dbWriteTable(conn = conn, name = "UN_immg", value = df_immg)

df_immg<-tbl(conn, "UN_immg")

Analysis

The discussion post talks about looking at migration patterns for the United States. Where are people coming to the US from and where are Americans going.

Who is coming to North America?

First, let’s look at some data integrity and make sure that the US is listed in a consistent format.

df<-df_immg %>%
  filter(origin=="Afghanistan" & destination %in% c(
                                                    "Northern America",
                                                    "United States of America", 
                                                    "Canada", 
                                                    "United States")) %>%
  select(year, origin, destination, total) %>%
  group_by(year, destination) %>%
  summarise(total = sum(total))

DT::datatable(as.data.frame(df),
         extensions = c('FixedColumns',"FixedHeader"),
          options = list(scrollX = TRUE,
                         paging=TRUE,
                         fixedHeader=TRUE,
                         pageLength = 5))

Looks like the US got lumped into Northern America after 2005. At least for Afghanistan - Let’s check with another example.

df<-df_immg %>%
  filter(origin=="Germany" & destination %in% c("Northern America",
                                                "United States of America", 
                                                "Canada", 
                                                "Mexico")) %>%
  select(year, origin, destination, total) %>%
  group_by(year, destination) %>%
  summarise(total = sum(total))

DT::datatable(as.data.frame(df),
         extensions = c('FixedColumns',"FixedHeader"),
          options = list(scrollX = TRUE,
                         paging=TRUE,
                         fixedHeader=TRUE,
                         pageLength = 5))

Ok so this is disappointing. Northern America has become a catch all for Canada and the United States some time after 2000. It’s going to be hard to split this into which exact country each nationality immigrated to if the US and Canada are grouped together.

df<-df_immg %>%
  filter(destination == "Northern America" & total >0) %>%
  select(year, origin, total) %>%
  group_by(year, origin) %>%
  arrange(desc(total))

DT::datatable(as.data.frame(df), 
         extensions = c('FixedColumns',"FixedHeader"),
          options = list(scrollX = TRUE, 
                         paging=TRUE,
                         fixedHeader=TRUE,
                         pageLength = 5))

Mexico dominates the migration in-flows to the US and Canada, as expected. The more recent numbers from China and India are the next closest. The numbers for Mexico do seem to ebb-and-flow though. The migration numbers surged in 2000 and dropped by 2005 only to surge again by 2015.

df_immg %>%
  filter(destination=="Northern America" & origin %in% c("Mexico", 
                                                         "China",
                                                         "India",
                                                         "Philippines")) %>%
  select(year,origin, total) %>%
  ggplot() +
  geom_line(aes(x=year, y=total, group=origin, color=origin))+
  labs(x="Year", y="Total Migration", title="Migration to Northern America")

We see here that there was general decrease in migration between 2000 and 2005 and a return to increases in 2015 for the top origin nations. Clearly, Mexico is far apart from the rest of the field.

Where are Americans going?

The regional groupings obviously take up a lot of the top values, but we can exclude some of them to get the board strokes of where Americans are mirgrating to. Again, people from the US are headed to Mexico, Central America, Latin America, and the Caribbean in large scale.

df_plot<-df_immg %>%
  filter(origin=="United States of America" & 
         destination !="WORLD" & 
         destination !="Developing regions" & 
         destination !="Less developed regions excluding least developed countries" &
         destination !="Developed regions")%>%
  select(year, origin, destination, total) %>%
  group_by(year, destination) %>%
  summarise(total = sum(total)) %>%
  arrange(desc(total)) %>%
  collect() %>%
  slice(1:7)

  ggplot() +
  geom_line(data=df_plot,aes(x=year, y=total, group=destination, color=destination)) +
  labs(x="Year", y="Total Migration", title="Migration from the United States")

We see a similar trend of people leaving the US and Canada as we saw coming into the United States. A little dip between 2000 and 2005, followed by a surge in 2015.

Afghanistan Revisited

In part 2 I initially looked at the childhood mortality of Afghanistan and I would like to circle back to Afghanistan in terms of mirgration. This is especially topical since the US has recently signed a peace agreement with the Taliban. This agreement will likely end our involvement in the country having not defeated the organization that initially brought the US to Afghanistan. Let’s see the mirgrantion pattern.

df_immg %>%
  filter(origin=="Afghanistan" & destination =="WORLD") %>%
  select(year, origin, destination, total) %>%
  group_by(year, destination) %>%
  summarise(total = sum(total)) %>%
  arrange(desc(total)) %>%
  collect() %>%
  slice(1:7) %>%
  ggplot() +
  geom_line(aes(x=year, y=total, group=destination, color=destination)) +
  labs(x="Year", y="Total Mirgration", title="Mirgration from Afghanistan")

The end of the Soviet incursion into Afghanistan led to huge numbers of people leaving the country in 1990. This migration trend until the rise of the Taliban in the mid-1990s. We again see a drop in migration after the US invaded the country and another inflection upwards after the US’s involvement in Afghanistan had become a stalemate.

Tidy Data - Project 2 Part 3 - UN Migration Data

Jeff Shamp

2020-03-08