library("tidyverse")
library("ggplot2")
require(maps)
Dataset 1 UN Migration, Courtesy of Salma Elshahawy
For this project we are required to use 3 wide datasets to practice tidying,cleaning and analysis of data. We have been asked to creat csv files from the datasets and import into R. Then we are required to use tidyr and dplyr to clean and analyse the data.
This dataset is taken from the UN site at https://www.un.org/en/development/desa/population/migration/data/estimates2/estimates15.asp.
The dataset contains estimated of international migration by age, sex and origin.We worked on the dataset that contains data by destination and origin and tried to answer the following questions from Salma:
The data has a lot of empty fields, so it required converting the empties to NAs while importing so that cleaning would be easier using tidyr
migrant <- read.csv("https://raw.githubusercontent.com/zahirf/Data607/master/UN_Migrant.csv", sep = ",", header = FALSE, stringsAsFactors=FALSE, na.strings = c("", "NA")) #import converting all empties to NA
There is a lot of cleaning needed with this one. The steps are outlined below:
colnames(migrant) <- migrant[16, ] #Need to use the country names in row 16 as headers automatically, it is too much to do manually
migrant<-migrant[,-(4:9)] #removing columns we won't need
migrant<-migrant[,-2]
colnames(migrant)[1:2] <- c("year","destination") #headers for the missing fields
migrant<-drop_na(migrant,year) #drop all rows with NA in the year column as those are just blank rows, rows 1 to 15 had information about the dataset and needed to be dropped.
migrant<-migrant%>%
select_if(~sum(!is.na(.)) > 0) #there were many extra blank columns after Zimbabwe that needed to be removed
migrant <- migrant[-1 ,] #remove duplicate column for country names
head(migrant,2) #check the first 2 rows
## year destination Afghanistan Albania Algeria American Samoa
## 17 1990 WORLD 6,823,350 180,284 921,727 2,041
## 18 1990 UN development groups .. .. .. ..
## Andorra Angola Anguilla Antigua and Barbuda Argentina Armenia Aruba
## 17 3,792 824,942 2,047 21,753 430,169 899,649 10,596
## 18 .. .. .. .. .. .. ..
## Australia Austria Azerbaijan Bahamas Bahrain Bangladesh Barbados
## 17 303,696 506,088 1,634,081 25,182 12,820 5,451,831 84,931
## 18 .. .. .. .. .. .. ..
## Belarus Belgium Belize Benin Bermuda Bhutan
## 17 1,767,606 365,360 36,117 234,314 71,702 28,465
## 18 .. .. .. .. .. ..
## Bolivia (Plurinational State of) Bonaire, Sint Eustatius and Saba
## 17 224,693 3,206
## 18 .. ..
## Bosnia and Herzegovina Botswana Brazil British Virgin Islands
## 17 861,766 26,053 500,392 3,094
## 18 .. .. .. ..
## Brunei Darussalam Bulgaria Burkina Faso Burundi Cabo Verde Cambodia
## 17 26,323 613,093 1,021,332 337,118 91,368 355,430
## 18 .. .. .. .. .. ..
## Cameroon Canada Cayman Islands Central African Republic Chad
## 17 115,853 998,163 373 46,362 336,802
## 18 .. .. .. .. ..
## Channel Islands Chile China China, Hong Kong SAR China, Macao SAR
## 17 18,726 493,026 4,231,648 551,080 95,648
## 18 .. .. .. .. ..
## Colombia Comoros Congo Cook Islands Costa Rica Côte d'Ivoire Croatia
## 17 1,009,935 40,083 96,372 17,488 69,711 366,348 425,807
## 18 .. .. .. .. .. .. ..
## Cuba Curaçao Cyprus Czechia Dem. People's Republic of Korea
## 17 835,796 43,190 174,378 277,260 39,784
## 18 .. .. .. .. ..
## Democratic Republic of the Congo Denmark Djibouti Dominica
## 17 436,526 201,761 5,308 42,437
## 18 .. .. .. ..
## Dominican Republic Ecuador Egypt El Salvador Equatorial Guinea
## 17 466,216 214,008 1,322,178 1,242,075 36,178
## 18 .. .. .. .. ..
## Eritrea Estonia Eswatini Ethiopia Falkland Islands (Malvinas)
## 17 170,603 113,905 35,181 1,689,955 260
## 18 .. .. .. .. ..
## Faroe Islands Fiji Finland France French Guiana French Polynesia
## 17 7,520 90,166 250,765 1,215,895 2,844 3,149
## 18 .. .. .. .. .. ..
## Gabon Gambia Georgia Germany Ghana Gibraltar Greece Greenland
## 17 15,352 36,280 919,454 2,929,448 371,162 11,994 1,022,459 9,510
## 18 .. .. .. .. .. .. .. ..
## Grenada Guadeloupe Guam Guatemala Guinea Guinea-Bissau Guyana
## 17 43,249 5,828 1,376 348,332 352,763 55,409 233,731
## 18 .. .. .. .. .. .. ..
## Haiti Holy See Honduras Hungary Iceland India Indonesia
## 17 528,873 31 156,594 387,514 17,621 6,623,177 1,638,365
## 18 .. .. .. .. .. .. ..
## Iran (Islamic Republic of) Iraq Ireland Isle of Man Israel
## 17 631,339 1,506,702 917,286 10,735 281,597
## 18 .. .. .. .. ..
## Italy Jamaica Japan Jordan Kazakhstan Kenya Kiribati Kuwait
## 17 3,351,006 589,010 608,921 313,997 2,971,639 250,340 4,053 81,611
## 18 .. .. .. .. .. .. .. ..
## Kyrgyzstan Lao People's Democratic Republic Latvia Lebanon Lesotho
## 17 522,578 483,021 215,134 509,323 191,339
## 18 .. .. .. .. ..
## Liberia Libya Liechtenstein Lithuania Luxembourg Madagascar Malawi
## 17 516,886 76,256 3,428 341,050 36,141 59,424 143,437
## 18 .. .. .. .. .. .. ..
## Malaysia Maldives Mali Malta Marshall Islands Martinique
## 17 562,762 2,193 647,436 110,746 1,426 11,041
## 18 .. .. .. .. .. ..
## Mauritania Mauritius Mayotte Mexico Micronesia (Fed. States of)
## 17 134,488 110,708 1,835 4,395,365 7,714
## 18 .. .. .. .. ..
## Monaco Mongolia Montenegro Montserrat Morocco Mozambique Myanmar
## 17 4,479 24,466 77,384 7,188 1,748,251 2,222,369 685,310
## 18 .. .. .. .. .. .. ..
## Namibia Nauru Nepal Netherlands New Caledonia New Zealand Nicaragua
## 17 16,079 1,419 748,060 723,638 4,151 388,173 442,126
## 18 .. .. .. .. .. .. ..
## Niger Nigeria Niue North Macedonia Northern Mariana Islands Norway
## 17 149,779 446,806 5,860 432,296 2,525 138,536
## 18 .. .. .. .. .. ..
## Oman Pakistan Palau Panama Papua New Guinea Paraguay Peru
## 17 12,535 3,343,328 2,958 134,743 3,111 297,979 314,854
## 18 .. .. .. .. .. .. ..
## Philippines Poland Portugal Puerto Rico Qatar Republic of Korea
## 17 2,033,684 1,510,415 1,873,457 1,200,821 12,204 1,624,729
## 18 .. .. .. .. .. ..
## Republic of Moldova Réunion Romania Russian Federation Rwanda
## 17 625,683 3,087 813,066 12,662,893 547,718
## 18 .. .. .. .. ..
## Saint Helena Saint Kitts and Nevis Saint Lucia
## 17 884 20,714 22,005
## 18 .. .. ..
## Saint Pierre and Miquelon Saint Vincent and the Grenadines Samoa
## 17 485 37,049 74,861
## 18 .. .. ..
## San Marino Sao Tome and Principe Saudi Arabia Senegal Serbia
## 17 1,419 13,977 107,166 370,263 742,547
## 18 .. .. .. .. ..
## Seychelles Sierra Leone Singapore Sint Maarten (Dutch part) Slovakia
## 17 35,633 61,854 156,468 14,823 133,006
## 18 .. .. .. .. ..
## Slovenia Solomon Islands Somalia South Africa South Sudan Spain
## 17 91,496 2,212 848,067 308,303 514,943 1,439,019
## 18 .. .. .. .. .. ..
## Sri Lanka State of Palestine Sudan Suriname Sweden Switzerland
## 17 885,951 1,813,063 584,940 179,870 206,848 326,276
## 18 .. .. .. .. .. ..
## Syrian Arab Republic Tajikistan Thailand Timor-Leste Togo Tokelau
## 17 621,881 537,701 311,308 11,261 193,830 1,684
## 18 .. .. .. .. .. ..
## Tonga Trinidad and Tobago Tunisia Turkey Turkmenistan
## 17 32,666 197,521 465,576 2,640,033 259,987
## 18 .. .. .. .. ..
## Turks and Caicos Islands Tuvalu Uganda Ukraine United Arab Emirates
## 17 2,311 2,350 311,602 5,545,760 79,545
## 18 .. .. .. .. ..
## United Kingdom United Republic of Tanzania United States of America
## 17 3,794,333 203,070 1,739,233
## 18 .. .. ..
## United States Virgin Islands Uruguay Uzbekistan Vanuatu
## 17 2,362 237,486 1,428,020 5,060
## 18 .. .. .. ..
## Venezuela (Bolivarian Republic of) Viet Nam Wallis and Futuna Islands
## 17 185,946 1,237,873 6,484
## 18 .. .. ..
## Western Sahara Yemen Zambia Zimbabwe
## 17 168,239 455,492 85,203 204,365
## 18 .. .. .. ..
We are only going to do out analysis on the migrants whose destination is the US, so we filter US using subset. We then remove the US column which is now redundant as this dataset only contains data for US as destination.
We then convert the data to long format.
Although the column numbers which contains migration numbers is stated numeric, the commas in between will cause problems when analysing, so we remove those commas.
migrant1<-subset(migrant, destination=="United States of America") #only filter the US Data
migrant1<-migrant1[,-2] #Removing the US column as this data is all about US as destination
migrant1<-migrant1%>%
gather(key="country", value="numbers", Afghanistan:Zimbabwe) #convert to long format
migrant1<-na.omit(migrant1) #remove the NAs
str(migrant1)
## 'data.frame': 1050 obs. of 3 variables:
## $ year : chr "1990" "1995" "2000" "2005" ...
## $ country: chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ numbers: chr "28,444" "37,312" "46,342" "51,965" ...
## - attr(*, "na.action")= 'omit' Named int 22 23 24 25 26 27 28 29 30 31 ...
## ..- attr(*, "names")= chr "22" "23" "24" "25" ...
migrant1$numbers<- as.numeric(gsub(",", "", as.character(migrant1$numbers))) # convert character to numeric so summarization and other analysis may be done.
1)Top ten places where immigrants came from in the last 19 years to the US
We can see that the top three countries where immigrants came from in the last 19 years are Mexico, China and Puerto Rico.
migrant2<-migrant1%>%
group_by(country)%>%
summarize(totalmigrants=sum(numbers))%>%
top_n(10,totalmigrants)%>%
arrange(desc(totalmigrants))
migrant2
## # A tibble: 10 x 2
## country totalmigrants
## <chr> <dbl>
## 1 Mexico 66449196
## 2 China 12287459
## 3 Puerto Rico 10995652
## 4 Philippines 10941865
## 5 India 10471258
## 6 Viet Nam 7344249
## 7 Cuba 7010218
## 8 El Salvador 6925286
## 9 Republic of Korea 6441503
## 10 Canada 5746699
Have migration patterns changed in the last 5 years?
In the last three years, we see that the top 3 countries whose citizens are migrating to the US are Vietnam, Korea and Guatemala accounting for almost 35% of total migration into the US.
migrant3<-migrant1%>%
filter(year>=2014 & year<=2019)%>%
group_by(country)%>%
summarize(totalmigrants=sum(numbers))%>%
mutate(percent=totalmigrants/sum(totalmigrants)*100)%>%
top_n(10,percent)%>%
arrange(desc(percent))
ggplot(migrant3, aes(x = country, y = percent, fill=country )) +
geom_col(position = "dodge") +
ggtitle("US: Top ten countries by immigrants from 2014 to 2019")
2)Which year had the most immigrants moving into the US?
we can see that 2019 had the most number of immigrants into the US, in fact the no of immigrants are increasing every year
migrant4<-migrant1%>%
group_by(year)%>%
summarize(totalmigrants=sum(numbers))%>%
top_n(10,totalmigrants)%>%
arrange(desc(totalmigrants))
ggplot(migrant4, aes(x = year, y =totalmigrants)) +
geom_col(position = "dodge") +
scale_y_discrete("totalmigrants")
Let us plot where the immigrants to US during 1990 to 2019 came from using a map.
world_map <- map_data("world")
names(migrant2)[1]="region"
mig_map <- left_join(world_map, migrant2, by = "region")
ggplot(mig_map, aes(map_id = region, fill = totalmigrants/1000))+
geom_map(map = mig_map, color = "white")+
expand_limits(x = mig_map$long, y = mig_map$lat)+
ggtitle("Total migrants into the US 1990 to 2019")
Let us do the same for the last 5 years
migrant5<-migrant1%>%
filter(year>=2014 & year<=2019)%>%
group_by(country)%>%
summarize(totalmigrants=sum(numbers))
world<- map_data("world")
names(migrant5)[1]="region"
mig_map1 <- left_join(world_map, migrant5, by = "region")
ggplot(mig_map1, aes(map_id = region, fill = totalmigrants/1000))+
geom_map(map = mig_map1, color = "white")+
expand_limits(x = mig_map1$long, y = mig_map1$lat)+
ggtitle("Total migrants into the US 2014 to 2019")
Our analysis shows that even though in the last 19 years overall, migration into the US has been dominated by the people from Mexico and surroundings, China, India and Canada, recently in the last 5 years the pattern has emerged to include more countries from Africa, South America and Asia.
A more proper analysis will include identifying trends for every 5 years.