Immigration is a hot-button issue in the U.S., particularly recently, so I wanted to use United Nations data to learn more about global trends. This data set contains counts of migrants by country/region of origin and country/region of destination as of 2015.
Data source:United Nations, Department of Economic and Social Affairs, Population Division (2015). “Trends in International Migrant Stock: Migrants by Destination and Origin (United Nations database, POP/DB/MIG/Stock/Rev.2015). Accessed https://www.un.org/en/development/desa/population/migration/data/estimates2/data/UN_MigrantStockByOriginAndDestination_2015.xlsx
I start by reading in the data and checking its structure.
library(tidyr)
library(dplyr)
library(readr)
library(ggplot2)
migrant <- read_csv("https://raw.githubusercontent.com/chrosemo/data607_fall19_project2/master/migrant.csv", skip=14, col_names = TRUE, col_types = cols (.default='c'))
head(migrant)
## # A tibble: 6 x 240
## `Sort\norder` `Major area, re~ Notes `Country code` `Type of data (~
## <chr> <chr> <chr> <chr> <chr>
## 1 <NA> <NA> <NA> <NA> <NA>
## 2 1 WORLD <NA> 900 <NA>
## 3 2 Developed regio~ (b) 901 <NA>
## 4 3 Developing regi~ (c) 902 <NA>
## 5 4 Least developed~ (d) 941 <NA>
## 6 5 Less developed ~ <NA> 934 <NA>
## # ... with 235 more variables: `Country of origin` <chr>, X7 <chr>,
## # X8 <chr>, X9 <chr>, X10 <chr>, X11 <chr>, X12 <chr>, X13 <chr>,
## # X14 <chr>, X15 <chr>, X16 <chr>, X17 <chr>, X18 <chr>, X19 <chr>,
## # X20 <chr>, X21 <chr>, X22 <chr>, X23 <chr>, X24 <chr>, X25 <chr>,
## # X26 <chr>, X27 <chr>, X28 <chr>, X29 <chr>, X30 <chr>, X31 <chr>,
## # X32 <chr>, X33 <chr>, X34 <chr>, X35 <chr>, X36 <chr>, X37 <chr>,
## # X38 <chr>, X39 <chr>, X40 <chr>, X41 <chr>, X42 <chr>, X43 <chr>,
## # X44 <chr>, X45 <chr>, X46 <chr>, X47 <chr>, X48 <chr>, X49 <chr>,
## # X50 <chr>, X51 <chr>, X52 <chr>, X53 <chr>, X54 <chr>, X55 <chr>,
## # X56 <chr>, X57 <chr>, X58 <chr>, X59 <chr>, X60 <chr>, X61 <chr>,
## # X62 <chr>, X63 <chr>, X64 <chr>, X65 <chr>, X66 <chr>, X67 <chr>,
## # X68 <chr>, X69 <chr>, X70 <chr>, X71 <chr>, X72 <chr>, X73 <chr>,
## # X74 <chr>, X75 <chr>, X76 <chr>, X77 <chr>, X78 <chr>, X79 <chr>,
## # X80 <chr>, X81 <chr>, X82 <chr>, X83 <chr>, X84 <chr>, X85 <chr>,
## # X86 <chr>, X87 <chr>, X88 <chr>, X89 <chr>, X90 <chr>, X91 <chr>,
## # X92 <chr>, X93 <chr>, X94 <chr>, X95 <chr>, X96 <chr>, X97 <chr>,
## # X98 <chr>, X99 <chr>, X100 <chr>, X101 <chr>, X102 <chr>, X103 <chr>,
## # X104 <chr>, X105 <chr>, ...
The CSV file has column names split across two rows, so I update the data frame’s first row and make its values the column names. After deleting the now redundant first row, I remove the blank spaces from the columns containing counts and then convert those columns to numeric format.
migrant[1, 1:5] = c('Sort.order','Destination', 'Notes', 'Code', 'Data.type')
colnames(migrant) <- as.character(migrant[1,])
migrant <- migrant[-1,]
migrant[6:240] <- lapply(migrant[6:240], function(y) as.numeric(gsub('\\s{1,2}', '', y)))
head(migrant)
## # A tibble: 6 x 240
## Sort.order Destination Notes Code Data.type Total `Other North`
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 1 WORLD <NA> 900 <NA> 2.44e8 2139539
## 2 2 Developed ~ (b) 901 <NA> 1.40e8 539780
## 3 3 Developing~ (c) 902 <NA> 1.03e8 1599759
## 4 4 Least deve~ (d) 941 <NA> 1.20e7 241805
## 5 5 Less devel~ <NA> 934 <NA> 9.13e7 1357954
## 6 6 Sub-Sahara~ (e) 947 <NA> 1.90e7 328171
## # ... with 233 more variables: `Other South` <dbl>, Afghanistan <dbl>,
## # Albania <dbl>, Algeria <dbl>, `American Samoa` <dbl>, Andorra <dbl>,
## # Angola <dbl>, Anguilla <dbl>, `Antigua and Barbuda` <dbl>,
## # Argentina <dbl>, Armenia <dbl>, Aruba <dbl>, Australia <dbl>,
## # Austria <dbl>, Azerbaijan <dbl>, Bahamas <dbl>, Bahrain <dbl>,
## # Bangladesh <dbl>, Barbados <dbl>, Belarus <dbl>, Belgium <dbl>,
## # Belize <dbl>, Benin <dbl>, Bermuda <dbl>, Bhutan <dbl>, `Bolivia
## # (Plurinational State of)` <dbl>, `Bonaire, Sint Eustatius and
## # Saba` <dbl>, `Bosnia and Herzegovina` <dbl>, Botswana <dbl>,
## # Brazil <dbl>, `British Virgin Islands` <dbl>, `Brunei
## # Darussalam` <dbl>, Bulgaria <dbl>, `Burkina Faso` <dbl>,
## # Burundi <dbl>, `Cabo Verde` <dbl>, Cambodia <dbl>, Cameroon <dbl>,
## # Canada <dbl>, `Cayman Islands` <dbl>, `Central African
## # Republic` <dbl>, Chad <dbl>, `Channel Islands` <dbl>, Chile <dbl>,
## # China <dbl>, `China, Hong Kong Special Administrative Region` <dbl>,
## # `China, Macao Special Administrative Region` <dbl>, Colombia <dbl>,
## # Comoros <dbl>, Congo <dbl>, `Cook Islands` <dbl>, `Costa Rica` <dbl>,
## # `C\xf4te d'Ivoire` <dbl>, Croatia <dbl>, Cuba <dbl>,
## # `Cura\xe7ao` <dbl>, Cyprus <dbl>, `Czech Republic` <dbl>, `Democratic
## # People's Republic of Korea` <dbl>, `Democratic Republic of the
## # Congo` <dbl>, Denmark <dbl>, Djibouti <dbl>, Dominica <dbl>,
## # `Dominican Republic` <dbl>, Ecuador <dbl>, Egypt <dbl>, `El
## # Salvador` <dbl>, `Equatorial Guinea` <dbl>, Eritrea <dbl>,
## # Estonia <dbl>, Ethiopia <dbl>, `Faeroe Islands` <dbl>, `Falkland
## # Islands (Malvinas)` <dbl>, Fiji <dbl>, Finland <dbl>, France <dbl>,
## # `French Guiana` <dbl>, `French Polynesia` <dbl>, Gabon <dbl>,
## # Gambia <dbl>, Georgia <dbl>, Germany <dbl>, Ghana <dbl>,
## # Gibraltar <dbl>, Greece <dbl>, Greenland <dbl>, Grenada <dbl>,
## # Guadeloupe <dbl>, Guam <dbl>, Guatemala <dbl>, Guinea <dbl>,
## # `Guinea-Bissau` <dbl>, Guyana <dbl>, Haiti <dbl>, `Holy See` <dbl>,
## # Honduras <dbl>, Hungary <dbl>, Iceland <dbl>, India <dbl>,
## # Indonesia <dbl>, ...
Next, I drop all rows that represent global or regional geographies (rows with missing values for “Data.type”); remove columns not necessary for my analysis; reshape to long format; filter out resulting rows where destination and origin are the same, i.e. irrelevant information; and arrange by destination. I finish by converting the origin count column to numeric format to facilitate analysis.
migrant <- migrant %>% drop_na(Data.type) %>% select(-c(1,3,4,5)) %>% gather(Origin, Origin.count, -Destination) %>% filter(Destination != Origin) %>% arrange(Destination)
migrant$Origin.count <- as.numeric(migrant$Origin.count)
head(migrant)
## # A tibble: 6 x 3
## Destination Origin Origin.count
## <chr> <chr> <dbl>
## 1 Afghanistan Total 382365
## 2 Afghanistan Other North 4554
## 3 Afghanistan Other South 13660
## 4 Afghanistan Albania NA
## 5 Afghanistan Algeria NA
## 6 Afghanistan American Samoa NA
One of my classmates provided a set of general questions for analysis, though not all apply directly to this 2015 gender-neutral data set. The questions are the following:
Worldwide, as of 2015, an estimated 243,700,236 individuals migrated from one UN-recognized geography to another. Among countries of origin, India has had the most total emigrants, with an estimated 15,575,724, followed by Mexico (an estimated 12,339,062) and the Russian Federation (an estimated 10,576,766).
Regarding countries with the fewest emigrants, the Holy See, with an estimated 182 emigrants, has had the fewest, followed by Saint Pierre and Miquelon (an estimated 435) and the Falkland Islands (Malvinas) (an estimated 1,124).
Total_by_origin <- migrant %>% group_by(Origin) %>% summarise(Origin.total = sum(Origin.count, na.rm=TRUE)) %>% arrange(-Origin.total)
head(Total_by_origin)
## # A tibble: 6 x 2
## Origin Origin.total
## <chr> <dbl>
## 1 Total 243700236
## 2 India 15575724
## 3 Mexico 12339062
## 4 Russian Federation 10576766
## 5 China 9546065
## 6 Other South 7644005
tail(Total_by_origin)
## # A tibble: 6 x 2
## Origin Origin.total
## <chr> <dbl>
## 1 Turks and Caicos Islands 1878
## 2 Cayman Islands 1569
## 3 French Polynesia 1337
## 4 Falkland Islands (Malvinas) 1124
## 5 Saint Pierre and Miquelon 435
## 6 Holy See 182
Globally, the United States has had the most immigrants as of 2015, with an estimated total of 46,627,102, followed by Germany (an estimated 12,005,690), and the Russian Federation (an estimated 11,643,276). The geographies with the fewest immigrants as of 2015 are Tuvalu (an estimated 141), Tokelau (an estimated 487), and Niue (an estimated 557).
migrant_totals <- migrant %>% filter(Origin == 'Total') %>% arrange(-Origin.count)
head(migrant_totals)
## # A tibble: 6 x 3
## Destination Origin Origin.count
## <chr> <chr> <dbl>
## 1 United States of America Total 46627102
## 2 Germany Total 12005690
## 3 Russian Federation Total 11643276
## 4 Saudi Arabia Total 10185945
## 5 United Kingdom of Great Britain and Northern Ireland Total 8543120
## 6 United Arab Emirates Total 8095126
tail(migrant_totals)
## # A tibble: 6 x 3
## Destination Origin Origin.count
## <chr> <chr> <dbl>
## 1 Saint Pierre and Miquelon Total 986
## 2 Holy See Total 800
## 3 Saint Helena Total 604
## 4 Niue Total 557
## 5 Tokelau Total 487
## 6 Tuvalu Total 141
The median number of immigrants across all countries/geographies is an estimated 149,726, with a 25th percentile of an estimated 28,080 and a 75th percentile of an estimated 704,272. A quick histogram shows that the distribution of immigrants by country is extremely right skewed, much like the distribution of total population by country.
summary(migrant_totals$Origin.count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 141 28080 149726 1050432 704272 46627102
ggplot(data=migrant_totals, mapping=aes(x=Origin.count)) +
geom_histogram(color="black", fill="white", bins=50) +
geom_vline(data=migrant_totals, aes(xintercept=mean(Origin.count)), linetype="dashed")
Excluding regional geographies, Mexico (an estimated 12,050,031), China (an estimated 2,103,551), and India (an estimated 1,969,286) have sent the most emigrants to the United States as of 2015. By contrast, Turkmenistan has sent the fewest, with an estimated 2,079. 81 countries/geographies have missing data in this data set.
USA_emigrants <- migrant %>% filter(Destination == 'United States of America') %>% arrange(-Origin.count)
head(USA_emigrants)
## # A tibble: 6 x 3
## Destination Origin Origin.count
## <chr> <chr> <dbl>
## 1 United States of America Total 46627102
## 2 United States of America Mexico 12050031
## 3 United States of America Other South 3068367
## 4 United States of America China 2103551
## 5 United States of America India 1969286
## 6 United States of America Philippines 1896031
tail(USA_emigrants)
## # A tibble: 6 x 3
## Destination Origin Origin.count
## <chr> <chr> <dbl>
## 1 United States of America Turks and Caicos Islands NA
## 2 United States of America Tuvalu NA
## 3 United States of America United States Virgin Islands NA
## 4 United States of America Vanuatu NA
## 5 United States of America Wallis and Futuna Islands NA
## 6 United States of America Western Sahara NA
summary(USA_emigrants$Origin.count)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2079 25976 63255 609505 197171 46627102 81
head(USA_emigrants[USA_emigrants$Origin.count == 2079,],1)
## # A tibble: 1 x 3
## Destination Origin Origin.count
## <chr> <chr> <dbl>
## 1 United States of America Turkmenistan 2079