Data Summary

Immigration is a hot-button issue in the U.S., particularly recently, so I wanted to use United Nations data to learn more about global trends. This data set contains counts of migrants by country/region of origin and country/region of destination as of 2015.

Data source:United Nations, Department of Economic and Social Affairs, Population Division (2015). “Trends in International Migrant Stock: Migrants by Destination and Origin (United Nations database, POP/DB/MIG/Stock/Rev.2015). Accessed https://www.un.org/en/development/desa/population/migration/data/estimates2/data/UN_MigrantStockByOriginAndDestination_2015.xlsx


Loading the data

I start by reading in the data and checking its structure.

library(tidyr)
library(dplyr)
library(readr)
library(ggplot2)
migrant <- read_csv("https://raw.githubusercontent.com/chrosemo/data607_fall19_project2/master/migrant.csv", skip=14, col_names = TRUE, col_types = cols (.default='c'))
head(migrant)
## # A tibble: 6 x 240
##   `Sort\norder` `Major area, re~ Notes `Country code` `Type of data (~
##   <chr>         <chr>            <chr> <chr>          <chr>           
## 1 <NA>          <NA>             <NA>  <NA>           <NA>            
## 2 1             WORLD            <NA>  900            <NA>            
## 3 2             Developed regio~ (b)   901            <NA>            
## 4 3             Developing regi~ (c)   902            <NA>            
## 5 4             Least developed~ (d)   941            <NA>            
## 6 5             Less developed ~ <NA>  934            <NA>            
## # ... with 235 more variables: `Country of origin` <chr>, X7 <chr>,
## #   X8 <chr>, X9 <chr>, X10 <chr>, X11 <chr>, X12 <chr>, X13 <chr>,
## #   X14 <chr>, X15 <chr>, X16 <chr>, X17 <chr>, X18 <chr>, X19 <chr>,
## #   X20 <chr>, X21 <chr>, X22 <chr>, X23 <chr>, X24 <chr>, X25 <chr>,
## #   X26 <chr>, X27 <chr>, X28 <chr>, X29 <chr>, X30 <chr>, X31 <chr>,
## #   X32 <chr>, X33 <chr>, X34 <chr>, X35 <chr>, X36 <chr>, X37 <chr>,
## #   X38 <chr>, X39 <chr>, X40 <chr>, X41 <chr>, X42 <chr>, X43 <chr>,
## #   X44 <chr>, X45 <chr>, X46 <chr>, X47 <chr>, X48 <chr>, X49 <chr>,
## #   X50 <chr>, X51 <chr>, X52 <chr>, X53 <chr>, X54 <chr>, X55 <chr>,
## #   X56 <chr>, X57 <chr>, X58 <chr>, X59 <chr>, X60 <chr>, X61 <chr>,
## #   X62 <chr>, X63 <chr>, X64 <chr>, X65 <chr>, X66 <chr>, X67 <chr>,
## #   X68 <chr>, X69 <chr>, X70 <chr>, X71 <chr>, X72 <chr>, X73 <chr>,
## #   X74 <chr>, X75 <chr>, X76 <chr>, X77 <chr>, X78 <chr>, X79 <chr>,
## #   X80 <chr>, X81 <chr>, X82 <chr>, X83 <chr>, X84 <chr>, X85 <chr>,
## #   X86 <chr>, X87 <chr>, X88 <chr>, X89 <chr>, X90 <chr>, X91 <chr>,
## #   X92 <chr>, X93 <chr>, X94 <chr>, X95 <chr>, X96 <chr>, X97 <chr>,
## #   X98 <chr>, X99 <chr>, X100 <chr>, X101 <chr>, X102 <chr>, X103 <chr>,
## #   X104 <chr>, X105 <chr>, ...


Tidying the data

The CSV file has column names split across two rows, so I update the data frame’s first row and make its values the column names. After deleting the now redundant first row, I remove the blank spaces from the columns containing counts and then convert those columns to numeric format.

migrant[1, 1:5] = c('Sort.order','Destination', 'Notes', 'Code', 'Data.type')
colnames(migrant) <- as.character(migrant[1,])
migrant <- migrant[-1,]
migrant[6:240] <- lapply(migrant[6:240], function(y) as.numeric(gsub('\\s{1,2}', '', y)))
head(migrant)
## # A tibble: 6 x 240
##   Sort.order Destination Notes Code  Data.type  Total `Other North`
##   <chr>      <chr>       <chr> <chr> <chr>      <dbl>         <dbl>
## 1 1          WORLD       <NA>  900   <NA>      2.44e8       2139539
## 2 2          Developed ~ (b)   901   <NA>      1.40e8        539780
## 3 3          Developing~ (c)   902   <NA>      1.03e8       1599759
## 4 4          Least deve~ (d)   941   <NA>      1.20e7        241805
## 5 5          Less devel~ <NA>  934   <NA>      9.13e7       1357954
## 6 6          Sub-Sahara~ (e)   947   <NA>      1.90e7        328171
## # ... with 233 more variables: `Other South` <dbl>, Afghanistan <dbl>,
## #   Albania <dbl>, Algeria <dbl>, `American Samoa` <dbl>, Andorra <dbl>,
## #   Angola <dbl>, Anguilla <dbl>, `Antigua and Barbuda` <dbl>,
## #   Argentina <dbl>, Armenia <dbl>, Aruba <dbl>, Australia <dbl>,
## #   Austria <dbl>, Azerbaijan <dbl>, Bahamas <dbl>, Bahrain <dbl>,
## #   Bangladesh <dbl>, Barbados <dbl>, Belarus <dbl>, Belgium <dbl>,
## #   Belize <dbl>, Benin <dbl>, Bermuda <dbl>, Bhutan <dbl>, `Bolivia
## #   (Plurinational State of)` <dbl>, `Bonaire, Sint Eustatius and
## #   Saba` <dbl>, `Bosnia and Herzegovina` <dbl>, Botswana <dbl>,
## #   Brazil <dbl>, `British Virgin Islands` <dbl>, `Brunei
## #   Darussalam` <dbl>, Bulgaria <dbl>, `Burkina Faso` <dbl>,
## #   Burundi <dbl>, `Cabo Verde` <dbl>, Cambodia <dbl>, Cameroon <dbl>,
## #   Canada <dbl>, `Cayman Islands` <dbl>, `Central African
## #   Republic` <dbl>, Chad <dbl>, `Channel Islands` <dbl>, Chile <dbl>,
## #   China <dbl>, `China, Hong Kong Special Administrative Region` <dbl>,
## #   `China, Macao Special Administrative Region` <dbl>, Colombia <dbl>,
## #   Comoros <dbl>, Congo <dbl>, `Cook Islands` <dbl>, `Costa Rica` <dbl>,
## #   `C\xf4te d'Ivoire` <dbl>, Croatia <dbl>, Cuba <dbl>,
## #   `Cura\xe7ao` <dbl>, Cyprus <dbl>, `Czech Republic` <dbl>, `Democratic
## #   People's Republic of Korea` <dbl>, `Democratic Republic of the
## #   Congo` <dbl>, Denmark <dbl>, Djibouti <dbl>, Dominica <dbl>,
## #   `Dominican Republic` <dbl>, Ecuador <dbl>, Egypt <dbl>, `El
## #   Salvador` <dbl>, `Equatorial Guinea` <dbl>, Eritrea <dbl>,
## #   Estonia <dbl>, Ethiopia <dbl>, `Faeroe Islands` <dbl>, `Falkland
## #   Islands (Malvinas)` <dbl>, Fiji <dbl>, Finland <dbl>, France <dbl>,
## #   `French Guiana` <dbl>, `French Polynesia` <dbl>, Gabon <dbl>,
## #   Gambia <dbl>, Georgia <dbl>, Germany <dbl>, Ghana <dbl>,
## #   Gibraltar <dbl>, Greece <dbl>, Greenland <dbl>, Grenada <dbl>,
## #   Guadeloupe <dbl>, Guam <dbl>, Guatemala <dbl>, Guinea <dbl>,
## #   `Guinea-Bissau` <dbl>, Guyana <dbl>, Haiti <dbl>, `Holy See` <dbl>,
## #   Honduras <dbl>, Hungary <dbl>, Iceland <dbl>, India <dbl>,
## #   Indonesia <dbl>, ...


Next, I drop all rows that represent global or regional geographies (rows with missing values for “Data.type”); remove columns not necessary for my analysis; reshape to long format; filter out resulting rows where destination and origin are the same, i.e. irrelevant information; and arrange by destination. I finish by converting the origin count column to numeric format to facilitate analysis.

migrant <- migrant %>% drop_na(Data.type) %>% select(-c(1,3,4,5)) %>% gather(Origin, Origin.count, -Destination) %>% filter(Destination != Origin) %>% arrange(Destination)
migrant$Origin.count <- as.numeric(migrant$Origin.count)
head(migrant)
## # A tibble: 6 x 3
##   Destination Origin         Origin.count
##   <chr>       <chr>                 <dbl>
## 1 Afghanistan Total                382365
## 2 Afghanistan Other North            4554
## 3 Afghanistan Other South           13660
## 4 Afghanistan Albania                  NA
## 5 Afghanistan Algeria                  NA
## 6 Afghanistan American Samoa           NA


Analyzing the data

One of my classmates provided a set of general questions for analysis, though not all apply directly to this 2015 gender-neutral data set. The questions are the following:

Which country/geography has had the most emigrants as of 2015? the fewest emigrants?

Which country/geography has had the most immigrants as of 2015? The fewest immigrants? The median number?

Which country/geography has sent the most emigrants to the United States as of 2015? The fewest emigrants?


Which country/geography has had the most emigrants as of 2015? The fewest emigrants?

Worldwide, as of 2015, an estimated 243,700,236 individuals migrated from one UN-recognized geography to another. Among countries of origin, India has had the most total emigrants, with an estimated 15,575,724, followed by Mexico (an estimated 12,339,062) and the Russian Federation (an estimated 10,576,766).

Regarding countries with the fewest emigrants, the Holy See, with an estimated 182 emigrants, has had the fewest, followed by Saint Pierre and Miquelon (an estimated 435) and the Falkland Islands (Malvinas) (an estimated 1,124).

Total_by_origin <- migrant %>% group_by(Origin) %>% summarise(Origin.total = sum(Origin.count, na.rm=TRUE)) %>% arrange(-Origin.total)
head(Total_by_origin)
## # A tibble: 6 x 2
##   Origin             Origin.total
##   <chr>                     <dbl>
## 1 Total                 243700236
## 2 India                  15575724
## 3 Mexico                 12339062
## 4 Russian Federation     10576766
## 5 China                   9546065
## 6 Other South             7644005
tail(Total_by_origin)
## # A tibble: 6 x 2
##   Origin                      Origin.total
##   <chr>                              <dbl>
## 1 Turks and Caicos Islands            1878
## 2 Cayman Islands                      1569
## 3 French Polynesia                    1337
## 4 Falkland Islands (Malvinas)         1124
## 5 Saint Pierre and Miquelon            435
## 6 Holy See                             182


Which country/geography has had the most immigrants as of 2015? The fewest immigrants? The median number?

Globally, the United States has had the most immigrants as of 2015, with an estimated total of 46,627,102, followed by Germany (an estimated 12,005,690), and the Russian Federation (an estimated 11,643,276). The geographies with the fewest immigrants as of 2015 are Tuvalu (an estimated 141), Tokelau (an estimated 487), and Niue (an estimated 557).

migrant_totals <- migrant %>% filter(Origin == 'Total') %>% arrange(-Origin.count)
head(migrant_totals)
## # A tibble: 6 x 3
##   Destination                                          Origin Origin.count
##   <chr>                                                <chr>         <dbl>
## 1 United States of America                             Total      46627102
## 2 Germany                                              Total      12005690
## 3 Russian Federation                                   Total      11643276
## 4 Saudi Arabia                                         Total      10185945
## 5 United Kingdom of Great Britain and Northern Ireland Total       8543120
## 6 United Arab Emirates                                 Total       8095126
tail(migrant_totals)
## # A tibble: 6 x 3
##   Destination               Origin Origin.count
##   <chr>                     <chr>         <dbl>
## 1 Saint Pierre and Miquelon Total           986
## 2 Holy See                  Total           800
## 3 Saint Helena              Total           604
## 4 Niue                      Total           557
## 5 Tokelau                   Total           487
## 6 Tuvalu                    Total           141


The median number of immigrants across all countries/geographies is an estimated 149,726, with a 25th percentile of an estimated 28,080 and a 75th percentile of an estimated 704,272. A quick histogram shows that the distribution of immigrants by country is extremely right skewed, much like the distribution of total population by country.

summary(migrant_totals$Origin.count)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      141    28080   149726  1050432   704272 46627102
ggplot(data=migrant_totals, mapping=aes(x=Origin.count)) +
  geom_histogram(color="black", fill="white", bins=50) +
  geom_vline(data=migrant_totals, aes(xintercept=mean(Origin.count)), linetype="dashed")


Which country/geography has sent the most emigrants to the United States as of 2015? The fewest emigrants?

Excluding regional geographies, Mexico (an estimated 12,050,031), China (an estimated 2,103,551), and India (an estimated 1,969,286) have sent the most emigrants to the United States as of 2015. By contrast, Turkmenistan has sent the fewest, with an estimated 2,079. 81 countries/geographies have missing data in this data set.

USA_emigrants <- migrant %>% filter(Destination == 'United States of America') %>% arrange(-Origin.count)
head(USA_emigrants)
## # A tibble: 6 x 3
##   Destination              Origin      Origin.count
##   <chr>                    <chr>              <dbl>
## 1 United States of America Total           46627102
## 2 United States of America Mexico          12050031
## 3 United States of America Other South      3068367
## 4 United States of America China            2103551
## 5 United States of America India            1969286
## 6 United States of America Philippines      1896031
tail(USA_emigrants)
## # A tibble: 6 x 3
##   Destination              Origin                       Origin.count
##   <chr>                    <chr>                               <dbl>
## 1 United States of America Turks and Caicos Islands               NA
## 2 United States of America Tuvalu                                 NA
## 3 United States of America United States Virgin Islands           NA
## 4 United States of America Vanuatu                                NA
## 5 United States of America Wallis and Futuna Islands              NA
## 6 United States of America Western Sahara                         NA
summary(USA_emigrants$Origin.count)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##     2079    25976    63255   609505   197171 46627102       81
head(USA_emigrants[USA_emigrants$Origin.count == 2079,],1)
## # A tibble: 1 x 3
##   Destination              Origin       Origin.count
##   <chr>                    <chr>               <dbl>
## 1 United States of America Turkmenistan         2079