library(knitr)
df <- read.csv('migration.csv', skip = 14, as.is = TRUE, header = FALSE)
my_colnames <- df[1,1:5]
country_colnames <- df[2, 6:ncol(df)]
df <- df[-1:-2,]
names(df) <- c(my_colnames, country_colnames)
df[152, 2] <- 'United Kingdom'
remove <- df[c(1:8, 29, 39, 46, 47, 53, 71, 72, 78, 80:81, 98, 108, 127:128, 139, 153, 170, 180:181, 208, 217, 232, 238:239), 2]
(head(df[,1:8], 10))
## Sort\norder Major area, region, country or area of destination Notes
## 3 1 WORLD
## 4 2 Developed regions (b)
## 5 3 Developing regions (c)
## 6 4 Least developed countries (d)
## 7 5 Less developed regions excluding least developed countries
## 8 6 Sub-Saharan Africa (e)
## 9 7 Africa
## 10 8 Eastern Africa
## 11 9 Burundi
## 12 10 Comoros
## Country code Type of data (a) Total Other North Other South
## 3 900 243 700 236 2 139 539 7 644 005
## 4 901 140 481 955 539 780 3 520 214
## 5 902 103 218 281 1 599 759 4 123 791
## 6 941 11 951 316 241 805 1 005 567
## 7 934 91 266 965 1 357 954 3 118 224
## 8 947 18 993 986 328 171 1 551 645
## 9 903 20 649 557 350 543 1 586 186
## 10 910 6 129 113 102 542 457 137
## 11 108 B R 286 810 3 242 20 599
## 12 174 B 12 555 597 754
The main issue with this dataset is that the country of origin for migration is split into 235 columns. We will need to gather these. There is also an issue with the values having spaces in them making them hard to calculate over. First I shall gather the columns. I will also remove some rows that deal with aggregate regions as we are going to be analyzing movement from the US to other specific countries, and as such having aggregated regions will not be useful to our analysis.
library(tidyverse)
library(stringr)
df <- df %>% filter(!(`Major area, region, country or area of destination` %in% remove))
df <- pivot_longer(df, 6:ncol(df), names_to = 'country_of_origin', values_to = 'count')
head(df, 10)
## # A tibble: 10 x 7
## `Sort\norder` `Major area, re~ Notes `Country code` `Type of data (~
## <chr> <chr> <chr> <chr> <chr>
## 1 9 Burundi "" 108 B R
## 2 9 Burundi "" 108 B R
## 3 9 Burundi "" 108 B R
## 4 9 Burundi "" 108 B R
## 5 9 Burundi "" 108 B R
## 6 9 Burundi "" 108 B R
## 7 9 Burundi "" 108 B R
## 8 9 Burundi "" 108 B R
## 9 9 Burundi "" 108 B R
## 10 9 Burundi "" 108 B R
## # ... with 2 more variables: country_of_origin <chr>, count <chr>
Lets quickly tidy the count column so that it is usable for analysis.
df$count <- as.integer(str_replace_all(df$count, ' ', ''))
head(df, 10)
## # A tibble: 10 x 7
## `Sort\norder` `Major area, re~ Notes `Country code` `Type of data (~
## <chr> <chr> <chr> <chr> <chr>
## 1 9 Burundi "" 108 B R
## 2 9 Burundi "" 108 B R
## 3 9 Burundi "" 108 B R
## 4 9 Burundi "" 108 B R
## 5 9 Burundi "" 108 B R
## 6 9 Burundi "" 108 B R
## 7 9 Burundi "" 108 B R
## 8 9 Burundi "" 108 B R
## 9 9 Burundi "" 108 B R
## 10 9 Burundi "" 108 B R
## # ... with 2 more variables: country_of_origin <chr>, count <int>
The original poster indicated that some of the columns may not be useful to this analysis, so I will remove the columns that seem unnecessary.
df <- df[,c(-1, -3, -5)]
head(df, 10)
## # A tibble: 10 x 4
## `Major area, region, country or area ~ `Country code` country_of_orig~ count
## <chr> <chr> <chr> <int>
## 1 Burundi 108 Total 286810
## 2 Burundi 108 Other North 3242
## 3 Burundi 108 Other South 20599
## 4 Burundi 108 Afghanistan NA
## 5 Burundi 108 Albania NA
## 6 Burundi 108 Algeria NA
## 7 Burundi 108 American Samoa NA
## 8 Burundi 108 Andorra NA
## 9 Burundi 108 Angola NA
## 10 Burundi 108 Anguilla NA
Let us focus in on the United States and see how many immigrants are entering the US.
names(df)[1:2] <- c('country_of_destination', 'country_code')
us_dest <- df %>% filter(country_of_destination == 'United States of America')
sum(us_dest$count, na.rm = TRUE)
## [1] 93254204
top10 <- us_dest %>% arrange(desc(count)) %>% .[2:11,]
top10 %>% ggplot(aes(reorder(country_of_origin, desc(count)), count, fill = country_of_origin)) +
geom_bar(stat = 'identity') + xlab('Origin Country') + theme(axis.text.x = element_text(angle = 90))+
ylab('Number of Immigrants') + ggtitle('Top 10 countries by Immigrants to US')
We see that somewhat unsurprisingly Mexico provides a huge proportion of the immigrants into the United States. The total number of immigrants is approximately 93 million, so Mexico makes up more than 1/8 of the total immigrant population. Now let us flip the script and see where US emmigrants are going to.
us_orig <- df %>% filter(country_of_origin == 'United States of America')
sum(us_orig$count, na.rm = TRUE)
## [1] 3107839
top10 <- us_orig %>% arrange(desc(count)) %>% .[c(1:10),]
top10 %>% ggplot(aes(reorder(country_of_destination, desc(count)), count, fill = country_of_destination)) +
geom_bar(stat = 'identity') + xlab('Destination Country') +
theme(axis.text.x = element_text(angle = 90)) +
ylab('Number of Immigrants') + ggtitle('Top 10 countries by Emmigrants from the US')
Again, Mexico receives the highest number of US emmigrants, although the number is much smaller than the number of immigrants from Mexico. However, the total number of emmigrants is only 3 million, so the proportion of Americans emmigrating to Mexico is almost 1/4 of all total emmigrants. One last analysis I would like to run is to find which countries had the highest difference between immigrants and emmigrants from the US.
diff <- inner_join(us_dest, us_orig, by = c('country_of_origin' = 'country_of_destination'))
diffobj <- diff %>% group_by(country_of_origin) %>% summarize(diff = sum(c(count.x, -count.y), na.rm = TRUE))
top10 <- diffobj %>% arrange(desc(diff)) %>% .[1:10,]
bot10 <- diffobj %>% arrange(diff) %>% .[1:10,]
top10 %>% ggplot(aes(reorder(country_of_origin, desc(diff)), diff, fill = country_of_origin)) +
geom_bar(stat = 'identity') + xlab('Country') +
theme(axis.text.x = element_text(angle = 90)) +
ylab('Difference in Immigrants and Emmigrants') +
ggtitle('Top 10 countries by Difference in Immigration')
bot10 %>% ggplot(aes(reorder(country_of_origin, diff), diff, fill = country_of_origin)) +
geom_bar(stat = 'identity') + xlab('Country') +
theme(axis.text.x = element_text(angle = 90)) +
ylab('Difference in Immigrants and Emmigrants') +
ggtitle('Bottom 10 countries by Difference in Immigration')
Once again somewhat unsuprisingly, we find that Mexico has a large difference in the number of immigrants to the US compared to the number of emmigrants from the US. However, we find some other interesting insights such as the fact that more Americans emmigrate to Australia than immigrate from Australia, and America has high net immigration from many populous Asian countries such as China and India.