library("tidyverse")
library("ggplot2")
require(maps)

Introduction

Dataset 1 UN Migration, Courtesy of Salma Elshahawy

For this project we are required to use 3 wide datasets to practice tidying,cleaning and analysis of data. We have been asked to creat csv files from the datasets and import into R. Then we are required to use tidyr and dplyr to clean and analyse the data.

This dataset is taken from the UN site at https://www.un.org/en/development/desa/population/migration/data/estimates2/estimates15.asp.

The dataset contains estimated of international migration by age, sex and origin.We worked on the dataset that contains data by destination and origin and tried to answer the following questions from Salma:

  1. How many migrants the US had from 1990 to 2019 and from which countries? What were the top ten countries migrants came from?
  2. which year had the most immigrant population in the US

Load the dataset

The data has a lot of empty fields, so it required converting the empties to NAs while importing so that cleaning would be easier using tidyr

migrant <- read.csv("https://raw.githubusercontent.com/zahirf/Data607/master/UN_Migrant.csv", sep = ",", header = FALSE, stringsAsFactors=FALSE,  na.strings = c("", "NA")) #import converting all empties to NA

Tidy the dataset

There is a lot of cleaning needed with this one. The steps are outlined below:

  1. The dataset had a few blank lines before the headers, so we did not import the header. We see that this dataset is very wide with the names of countries in the columns. Row no 16 has all the country names so for now we are using that row as header
  2. Removed the columns 2, 4:9 that we do not require as we will do a country level analysis and not a regional analysis
  3. Fill the missing headers year and destination country
  4. Drop the NA values using the dplyr drop_na function to remove NAs in the year column
  5. There are a lot of blank columns after the last field Zimbabwe so we only select the columns with values
  6. The header rwo is now duplicated so we remove the first row
colnames(migrant) <- migrant[16, ] #Need to use the country names in row 16 as headers automatically, it is too much to do manually
migrant<-migrant[,-(4:9)] #removing columns we won't need
migrant<-migrant[,-2]
colnames(migrant)[1:2] <- c("year","destination") #headers for the missing fields
migrant<-drop_na(migrant,year) #drop all rows with NA in the year column as those are just blank rows, rows 1 to 15 had information about the dataset and needed to be dropped.
migrant<-migrant%>%
  select_if(~sum(!is.na(.)) > 0) #there were many extra blank columns after Zimbabwe that needed to be removed
migrant <- migrant[-1 ,] #remove duplicate column for country names
head(migrant,2) #check the first 2 rows
##    year           destination Afghanistan Albania Algeria American Samoa
## 17 1990                 WORLD   6,823,350 180,284 921,727          2,041
## 18 1990 UN development groups          ..      ..      ..             ..
##    Andorra  Angola Anguilla Antigua and Barbuda Argentina Armenia  Aruba
## 17   3,792 824,942    2,047              21,753   430,169 899,649 10,596
## 18      ..      ..       ..                  ..        ..      ..     ..
##    Australia Austria Azerbaijan Bahamas Bahrain Bangladesh Barbados
## 17   303,696 506,088  1,634,081  25,182  12,820  5,451,831   84,931
## 18        ..      ..         ..      ..      ..         ..       ..
##      Belarus Belgium Belize   Benin Bermuda Bhutan
## 17 1,767,606 365,360 36,117 234,314  71,702 28,465
## 18        ..      ..     ..      ..      ..     ..
##    Bolivia (Plurinational State of) Bonaire, Sint Eustatius and Saba
## 17                          224,693                            3,206
## 18                               ..                               ..
##    Bosnia and Herzegovina Botswana  Brazil British Virgin Islands
## 17                861,766   26,053 500,392                  3,094
## 18                     ..       ..      ..                     ..
##    Brunei Darussalam Bulgaria Burkina Faso Burundi Cabo Verde Cambodia
## 17            26,323  613,093    1,021,332 337,118     91,368  355,430
## 18                ..       ..           ..      ..         ..       ..
##    Cameroon  Canada Cayman Islands Central African Republic    Chad
## 17  115,853 998,163            373                   46,362 336,802
## 18       ..      ..             ..                       ..      ..
##    Channel Islands   Chile     China China, Hong Kong SAR China, Macao SAR
## 17          18,726 493,026 4,231,648              551,080           95,648
## 18              ..      ..        ..                   ..               ..
##     Colombia Comoros  Congo Cook Islands Costa Rica Côte d'Ivoire Croatia
## 17 1,009,935  40,083 96,372       17,488     69,711       366,348 425,807
## 18        ..      ..     ..           ..         ..            ..      ..
##       Cuba Curaçao  Cyprus Czechia Dem. People's Republic of Korea
## 17 835,796  43,190 174,378 277,260                          39,784
## 18      ..      ..      ..      ..                              ..
##    Democratic Republic of the Congo Denmark Djibouti Dominica
## 17                          436,526 201,761    5,308   42,437
## 18                               ..      ..       ..       ..
##    Dominican Republic Ecuador     Egypt El Salvador Equatorial Guinea
## 17            466,216 214,008 1,322,178   1,242,075            36,178
## 18                 ..      ..        ..          ..                ..
##    Eritrea Estonia Eswatini  Ethiopia Falkland Islands (Malvinas)
## 17 170,603 113,905   35,181 1,689,955                         260
## 18      ..      ..       ..        ..                          ..
##    Faroe Islands   Fiji Finland    France French Guiana French Polynesia
## 17         7,520 90,166 250,765 1,215,895         2,844            3,149
## 18            ..     ..      ..        ..            ..               ..
##     Gabon Gambia Georgia   Germany   Ghana Gibraltar    Greece Greenland
## 17 15,352 36,280 919,454 2,929,448 371,162    11,994 1,022,459     9,510
## 18     ..     ..      ..        ..      ..        ..        ..        ..
##    Grenada Guadeloupe  Guam Guatemala  Guinea Guinea-Bissau  Guyana
## 17  43,249      5,828 1,376   348,332 352,763        55,409 233,731
## 18      ..         ..    ..        ..      ..            ..      ..
##      Haiti Holy See Honduras Hungary Iceland     India Indonesia
## 17 528,873       31  156,594 387,514  17,621 6,623,177 1,638,365
## 18      ..       ..       ..      ..      ..        ..        ..
##    Iran (Islamic Republic of)      Iraq Ireland Isle of Man  Israel
## 17                    631,339 1,506,702 917,286      10,735 281,597
## 18                         ..        ..      ..          ..      ..
##        Italy Jamaica   Japan  Jordan Kazakhstan   Kenya Kiribati Kuwait
## 17 3,351,006 589,010 608,921 313,997  2,971,639 250,340    4,053 81,611
## 18        ..      ..      ..      ..         ..      ..       ..     ..
##    Kyrgyzstan Lao People's Democratic Republic  Latvia Lebanon Lesotho
## 17    522,578                          483,021 215,134 509,323 191,339
## 18         ..                               ..      ..      ..      ..
##    Liberia  Libya Liechtenstein Lithuania Luxembourg Madagascar  Malawi
## 17 516,886 76,256         3,428   341,050     36,141     59,424 143,437
## 18      ..     ..            ..        ..         ..         ..      ..
##    Malaysia Maldives    Mali   Malta Marshall Islands Martinique
## 17  562,762    2,193 647,436 110,746            1,426     11,041
## 18       ..       ..      ..      ..               ..         ..
##    Mauritania Mauritius Mayotte    Mexico Micronesia (Fed. States of)
## 17    134,488   110,708   1,835 4,395,365                       7,714
## 18         ..        ..      ..        ..                          ..
##    Monaco Mongolia Montenegro Montserrat   Morocco Mozambique Myanmar
## 17  4,479   24,466     77,384      7,188 1,748,251  2,222,369 685,310
## 18     ..       ..         ..         ..        ..         ..      ..
##    Namibia Nauru   Nepal Netherlands New Caledonia New Zealand Nicaragua
## 17  16,079 1,419 748,060     723,638         4,151     388,173   442,126
## 18      ..    ..      ..          ..            ..          ..        ..
##      Niger Nigeria  Niue North Macedonia Northern Mariana Islands  Norway
## 17 149,779 446,806 5,860         432,296                    2,525 138,536
## 18      ..      ..    ..              ..                       ..      ..
##      Oman  Pakistan Palau  Panama Papua New Guinea Paraguay    Peru
## 17 12,535 3,343,328 2,958 134,743            3,111  297,979 314,854
## 18     ..        ..    ..      ..               ..       ..      ..
##    Philippines    Poland  Portugal Puerto Rico  Qatar Republic of Korea
## 17   2,033,684 1,510,415 1,873,457   1,200,821 12,204         1,624,729
## 18          ..        ..        ..          ..     ..                ..
##    Republic of Moldova Réunion Romania Russian Federation  Rwanda
## 17             625,683   3,087 813,066         12,662,893 547,718
## 18                  ..      ..      ..                 ..      ..
##    Saint Helena Saint Kitts and Nevis Saint Lucia
## 17          884                20,714      22,005
## 18           ..                    ..          ..
##    Saint Pierre and Miquelon Saint Vincent and the Grenadines  Samoa
## 17                       485                           37,049 74,861
## 18                        ..                               ..     ..
##    San Marino Sao Tome and Principe Saudi Arabia Senegal  Serbia
## 17      1,419                13,977      107,166 370,263 742,547
## 18         ..                    ..           ..      ..      ..
##    Seychelles Sierra Leone Singapore Sint Maarten (Dutch part) Slovakia
## 17     35,633       61,854   156,468                    14,823  133,006
## 18         ..           ..        ..                        ..       ..
##    Slovenia Solomon Islands Somalia South Africa South Sudan     Spain
## 17   91,496           2,212 848,067      308,303     514,943 1,439,019
## 18       ..              ..      ..           ..          ..        ..
##    Sri Lanka State of Palestine   Sudan Suriname  Sweden Switzerland
## 17   885,951          1,813,063 584,940  179,870 206,848     326,276
## 18        ..                 ..      ..       ..      ..          ..
##    Syrian Arab Republic Tajikistan Thailand Timor-Leste    Togo Tokelau
## 17              621,881    537,701  311,308      11,261 193,830   1,684
## 18                   ..         ..       ..          ..      ..      ..
##     Tonga Trinidad and Tobago Tunisia    Turkey Turkmenistan
## 17 32,666             197,521 465,576 2,640,033      259,987
## 18     ..                  ..      ..        ..           ..
##    Turks and Caicos Islands Tuvalu  Uganda   Ukraine United Arab Emirates
## 17                    2,311  2,350 311,602 5,545,760               79,545
## 18                       ..     ..      ..        ..                   ..
##    United Kingdom United Republic of Tanzania United States of America
## 17      3,794,333                     203,070                1,739,233
## 18             ..                          ..                       ..
##    United States Virgin Islands Uruguay Uzbekistan Vanuatu
## 17                        2,362 237,486  1,428,020   5,060
## 18                           ..      ..         ..      ..
##    Venezuela (Bolivarian Republic of)  Viet Nam Wallis and Futuna Islands
## 17                            185,946 1,237,873                     6,484
## 18                                 ..        ..                        ..
##    Western Sahara   Yemen Zambia Zimbabwe
## 17        168,239 455,492 85,203  204,365
## 18             ..      ..     ..       ..

We are only going to do out analysis on the migrants whose destination is the US, so we filter US using subset. We then remove the US column which is now redundant as this dataset only contains data for US as destination.

We then convert the data to long format.

Although the column numbers which contains migration numbers is stated numeric, the commas in between will cause problems when analysing, so we remove those commas.

migrant1<-subset(migrant, destination=="United States of America") #only filter the US Data 
migrant1<-migrant1[,-2] #Removing the US column as this data is all about US as destination
migrant1<-migrant1%>%
  gather(key="country", value="numbers", Afghanistan:Zimbabwe) #convert to long format
migrant1<-na.omit(migrant1) #remove the NAs
str(migrant1)
## 'data.frame':    1050 obs. of  3 variables:
##  $ year   : chr  "1990" "1995" "2000" "2005" ...
##  $ country: chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ numbers: chr  "28,444" "37,312" "46,342" "51,965" ...
##  - attr(*, "na.action")= 'omit' Named int  22 23 24 25 26 27 28 29 30 31 ...
##   ..- attr(*, "names")= chr  "22" "23" "24" "25" ...
migrant1$numbers<- as.numeric(gsub(",", "", as.character(migrant1$numbers))) # convert character to numeric so summarization and other analysis may be done.

Analysis and Visualization

1)Top ten places where immigrants came from in the last 19 years to the US

We can see that the top three countries where immigrants came from in the last 19 years are Mexico, China and Puerto Rico.

migrant2<-migrant1%>%
  group_by(country)%>%
  summarize(totalmigrants=sum(numbers))%>%
  top_n(10,totalmigrants)%>%
  arrange(desc(totalmigrants))
migrant2 
## # A tibble: 10 x 2
##    country           totalmigrants
##    <chr>                     <dbl>
##  1 Mexico                 66449196
##  2 China                  12287459
##  3 Puerto Rico            10995652
##  4 Philippines            10941865
##  5 India                  10471258
##  6 Viet Nam                7344249
##  7 Cuba                    7010218
##  8 El Salvador             6925286
##  9 Republic of Korea       6441503
## 10 Canada                  5746699

Have migration patterns changed in the last 5 years?

In the last three years, we see that the top 3 countries whose citizens are migrating to the US are Vietnam, Korea and Guatemala accounting for almost 35% of total migration into the US.

migrant3<-migrant1%>%
  filter(year>=2014 & year<=2019)%>%
  group_by(country)%>%
  summarize(totalmigrants=sum(numbers))%>%
  mutate(percent=totalmigrants/sum(totalmigrants)*100)%>%
  top_n(10,percent)%>%
  arrange(desc(percent))

ggplot(migrant3, aes(x = country, y = percent, fill=country )) +
  geom_col(position = "dodge") +
  ggtitle("US: Top ten countries by immigrants from 2014 to 2019")

2)Which year had the most immigrants moving into the US?

we can see that 2019 had the most number of immigrants into the US, in fact the no of immigrants are increasing every year

migrant4<-migrant1%>%
  group_by(year)%>%
  summarize(totalmigrants=sum(numbers))%>%
  top_n(10,totalmigrants)%>%
  arrange(desc(totalmigrants))

ggplot(migrant4, aes(x = year, y =totalmigrants)) +
  geom_col(position = "dodge") +
  scale_y_discrete("totalmigrants")

Let us plot where the immigrants to US during 1990 to 2019 came from using a map.

world_map <- map_data("world")
names(migrant2)[1]="region"
mig_map <- left_join(world_map, migrant2, by = "region")
ggplot(mig_map, aes(map_id = region, fill = totalmigrants/1000))+
  geom_map(map = mig_map,  color = "white")+
  expand_limits(x = mig_map$long, y = mig_map$lat)+
  ggtitle("Total migrants into the US 1990 to 2019")

Let us do the same for the last 5 years

migrant5<-migrant1%>%
  filter(year>=2014 & year<=2019)%>%
  group_by(country)%>%
  summarize(totalmigrants=sum(numbers))

world<- map_data("world")
names(migrant5)[1]="region"
mig_map1 <- left_join(world_map, migrant5, by = "region")
ggplot(mig_map1, aes(map_id = region, fill = totalmigrants/1000))+
  geom_map(map = mig_map1,  color = "white")+
  expand_limits(x = mig_map1$long, y = mig_map1$lat)+
  ggtitle("Total migrants into the US 2014 to 2019")

Conclusion

Our analysis shows that even though in the last 19 years overall, migration into the US has been dominated by the people from Mexico and surroundings, China, India and Canada, recently in the last 5 years the pattern has emerged to include more countries from Africa, South America and Asia.

A more proper analysis will include identifying trends for every 5 years.