The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work. Your task is to:
The United Nations, Department of Economic and Social Affairs published dataset presents estimates of international migrant by age, sex and origin. Estimates are presented for 1990, 1995, 2000, 2005, 2010, 2015 and 2019 and are available for all countries and areas of the world. The estimates are based on official statistics on the foreign-born or the foreign population.
I decided to get more insight about the data provided by the United Nation for the international migrant stock on the last decade. Some of the questions I want to invistigate is:
I will start by investigating on the total number of migrants by destination and origin country. First, we read the dataset into R.
The files provided by the UN are Excel files, each file contains several tables nested as worksheets. To be able to read a specific table, I downloaded the file, then save as the file as a csv, each table into individual file (the folders are included in GihHub repo). After that I was able to read the file using readr library using read_csv(). I used skip = 16 argument inside read_csv to remove the worksheet header (United nation information, title,…etc).
# read the file into R - use skip to remove
UN_MigrantStock <- read_csv("UN_MigrantStockByOriginAndDestination_2019/Table 1-Table 1.csv",
skip = 16)
UN_MigrantStock
We get in return 1,991 rows with 246 columns - wide table version. this dataset needs tidying, but to do so, we have to clean up first NA, unused columns, from char to dbl types,…etc.
First thing you would notice is that the first 8 rows are not countries, they are regions. This time we want to see how many people are migrating from countries to countries so we can remove these rows for the regions. When you look at ‘X6’ column, it looks that those ‘region’ rows don’t have any value there.
By running the command like below to keep only the rows whose X6 column have NA
# Use filter() from tidyr lib to filter the dataset by only regions
temp <- filter(UN_MigrantStock ,is.na(X6))
head(temp)
These are all regions, not countries, which means that we can safely remove these rows by adding an exclamation mark ‘!’ right before ‘is.na()’ function like below.
# adding ! to is.na() to get countries only
UN_Countries <- filter(UN_MigrantStock ,!is.na(X6))
head(UN_Countries)
## [1] 1624 246
Now we got a dataframe of 1,624 rows with 246 columns as return with all countries only data per year interval.
When you look at the columns we would notice that there are unnecessary columns like ‘Total’, ‘Other South’, etc, because we are interested in estimates of the migrants only for countries to countries. We can remove those unnecessary columns with ‘select’ command along with other unnecessary columns like below. I’m using minus ‘-’ to delete columns and using ‘start_with’ function inside ‘select’ command to delete multiple columns whose names matche the text pattern of “Other”.
UN_Countries_df <- UN_Countries %>%
select(-X2, -X4, -X5, -X6, -Total, -starts_with("Other"))
head(UN_Countries_df)
## [1] 1624 239
We now have 239 columns with 1,624 rows in total as a dataframe with only country to country data.
UN_Countries_rename <- UN_Countries_df %>%
rename(
year = X1,
destination_country = X3
)
head(UN_Countries_rename)
Now, it’s ready to tidy this ‘matrix’-ish data form by using ‘gather’ command from tidyr package. I will gather all origin countries in one column with a corresponding migrants number. Basically, we will gather from column number 4 to column 234 into two columns origin_country and migrants. I used colnames() to get the exact columns number
## [1] "year"
## [2] "destination_country"
## [3] "Afghanistan"
## [4] "Albania"
## [5] "Algeria"
## [6] "American Samoa"
## [7] "Andorra"
## [8] "Angola"
## [9] "Anguilla"
## [10] "Antigua and Barbuda"
## [11] "Argentina"
## [12] "Armenia"
## [13] "Aruba"
## [14] "Australia"
## [15] "Austria"
## [16] "Azerbaijan"
## [17] "Bahamas"
## [18] "Bahrain"
## [19] "Bangladesh"
## [20] "Barbados"
## [21] "Belarus"
## [22] "Belgium"
## [23] "Belize"
## [24] "Benin"
## [25] "Bermuda"
## [26] "Bhutan"
## [27] "Bolivia (Plurinational State of)"
## [28] "Bonaire, Sint Eustatius and Saba"
## [29] "Bosnia and Herzegovina"
## [30] "Botswana"
## [31] "Brazil"
## [32] "British Virgin Islands"
## [33] "Brunei Darussalam"
## [34] "Bulgaria"
## [35] "Burkina Faso"
## [36] "Burundi"
## [37] "Cabo Verde"
## [38] "Cambodia"
## [39] "Cameroon"
## [40] "Canada"
## [41] "Cayman Islands"
## [42] "Central African Republic"
## [43] "Chad"
## [44] "Channel Islands"
## [45] "Chile"
## [46] "China"
## [47] "China, Hong Kong SAR"
## [48] "China, Macao SAR"
## [49] "Colombia"
## [50] "Comoros"
## [51] "Congo"
## [52] "Cook Islands"
## [53] "Costa Rica"
## [54] "Côte d'Ivoire"
## [55] "Croatia"
## [56] "Cuba"
## [57] "Curaçao"
## [58] "Cyprus"
## [59] "Czechia"
## [60] "Dem. People's Republic of Korea"
## [61] "Democratic Republic of the Congo"
## [62] "Denmark"
## [63] "Djibouti"
## [64] "Dominica"
## [65] "Dominican Republic"
## [66] "Ecuador"
## [67] "Egypt"
## [68] "El Salvador"
## [69] "Equatorial Guinea"
## [70] "Eritrea"
## [71] "Estonia"
## [72] "Eswatini"
## [73] "Ethiopia"
## [74] "Falkland Islands (Malvinas)"
## [75] "Faroe Islands"
## [76] "Fiji"
## [77] "Finland"
## [78] "France"
## [79] "French Guiana"
## [80] "French Polynesia"
## [81] "Gabon"
## [82] "Gambia"
## [83] "Georgia"
## [84] "Germany"
## [85] "Ghana"
## [86] "Gibraltar"
## [87] "Greece"
## [88] "Greenland"
## [89] "Grenada"
## [90] "Guadeloupe"
## [91] "Guam"
## [92] "Guatemala"
## [93] "Guinea"
## [94] "Guinea-Bissau"
## [95] "Guyana"
## [96] "Haiti"
## [97] "Holy See"
## [98] "Honduras"
## [99] "Hungary"
## [100] "Iceland"
## [101] "India"
## [102] "Indonesia"
## [103] "Iran (Islamic Republic of)"
## [104] "Iraq"
## [105] "Ireland"
## [106] "Isle of Man"
## [107] "Israel"
## [108] "Italy"
## [109] "Jamaica"
## [110] "Japan"
## [111] "Jordan"
## [112] "Kazakhstan"
## [113] "Kenya"
## [114] "Kiribati"
## [115] "Kuwait"
## [116] "Kyrgyzstan"
## [117] "Lao People's Democratic Republic"
## [118] "Latvia"
## [119] "Lebanon"
## [120] "Lesotho"
## [121] "Liberia"
## [122] "Libya"
## [123] "Liechtenstein"
## [124] "Lithuania"
## [125] "Luxembourg"
## [126] "Madagascar"
## [127] "Malawi"
## [128] "Malaysia"
## [129] "Maldives"
## [130] "Mali"
## [131] "Malta"
## [132] "Marshall Islands"
## [133] "Martinique"
## [134] "Mauritania"
## [135] "Mauritius"
## [136] "Mayotte"
## [137] "Mexico"
## [138] "Micronesia (Fed. States of)"
## [139] "Monaco"
## [140] "Mongolia"
## [141] "Montenegro"
## [142] "Montserrat"
## [143] "Morocco"
## [144] "Mozambique"
## [145] "Myanmar"
## [146] "Namibia"
## [147] "Nauru"
## [148] "Nepal"
## [149] "Netherlands"
## [150] "New Caledonia"
## [151] "New Zealand"
## [152] "Nicaragua"
## [153] "Niger"
## [154] "Nigeria"
## [155] "Niue"
## [156] "North Macedonia"
## [157] "Northern Mariana Islands"
## [158] "Norway"
## [159] "Oman"
## [160] "Pakistan"
## [161] "Palau"
## [162] "Panama"
## [163] "Papua New Guinea"
## [164] "Paraguay"
## [165] "Peru"
## [166] "Philippines"
## [167] "Poland"
## [168] "Portugal"
## [169] "Puerto Rico"
## [170] "Qatar"
## [171] "Republic of Korea"
## [172] "Republic of Moldova"
## [173] "Réunion"
## [174] "Romania"
## [175] "Russian Federation"
## [176] "Rwanda"
## [177] "Saint Helena"
## [178] "Saint Kitts and Nevis"
## [179] "Saint Lucia"
## [180] "Saint Pierre and Miquelon"
## [181] "Saint Vincent and the Grenadines"
## [182] "Samoa"
## [183] "San Marino"
## [184] "Sao Tome and Principe"
## [185] "Saudi Arabia"
## [186] "Senegal"
## [187] "Serbia"
## [188] "Seychelles"
## [189] "Sierra Leone"
## [190] "Singapore"
## [191] "Sint Maarten (Dutch part)"
## [192] "Slovakia"
## [193] "Slovenia"
## [194] "Solomon Islands"
## [195] "Somalia"
## [196] "South Africa"
## [197] "South Sudan"
## [198] "Spain"
## [199] "Sri Lanka"
## [200] "State of Palestine"
## [201] "Sudan"
## [202] "Suriname"
## [203] "Sweden"
## [204] "Switzerland"
## [205] "Syrian Arab Republic"
## [206] "Tajikistan"
## [207] "Thailand"
## [208] "Timor-Leste"
## [209] "Togo"
## [210] "Tokelau"
## [211] "Tonga"
## [212] "Trinidad and Tobago"
## [213] "Tunisia"
## [214] "Turkey"
## [215] "Turkmenistan"
## [216] "Turks and Caicos Islands"
## [217] "Tuvalu"
## [218] "Uganda"
## [219] "Ukraine"
## [220] "United Arab Emirates"
## [221] "United Kingdom"
## [222] "United Republic of Tanzania"
## [223] "United States of America"
## [224] "United States Virgin Islands"
## [225] "Uruguay"
## [226] "Uzbekistan"
## [227] "Vanuatu"
## [228] "Venezuela (Bolivarian Republic of)"
## [229] "Viet Nam"
## [230] "Wallis and Futuna Islands"
## [231] "Western Sahara"
## [232] "Yemen"
## [233] "Zambia"
## [234] "Zimbabwe"
## [235] "X242"
## [236] "X243"
## [237] "X244"
## [238] "X245"
## [239] "X246"
Countries_by_no <- gather(UN_Countries_rename, "origin_country", "migrants", 3:234, na.rm = TRUE)
head(Countries_by_no)
# remove extra columns generated
clean_country_df <- Countries_by_no[, -c(3:7)]
dim(clean_country_df)
## [1] 78754 4
Now we condensed our data into long version - only 4 columns. However, migrants column is a character type, we need to convert it into double type. I used the parse_number() method inside the readr package to do the job for me.
Here we want to know which countries has the highest number of migrant by year. To do so, I would use group_by from tidyr then get the summation of all migrants per destination country.
# change the migrants variable into num type
num_country_df <- clean_country_df
num_country_df$migrants <- parse_number(num_country_df$migrants)
str(num_country_df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 78754 obs. of 4 variables:
## $ year : num 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
## $ destination_country: chr "Namibia" "South Africa" "Egypt" "Libya" ...
## $ origin_country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ migrants : num 64 59 237 677 254 ...
## ..- attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame': 345 obs. of 4 variables:
## .. ..$ row : int 496 546 597 648 1397 1419 1589 1644 1938 1956 ...
## .. ..$ col : int NA NA NA NA NA NA NA NA NA NA ...
## .. ..$ expected: chr "a number" "a number" "a number" "a number" ...
## .. ..$ actual : chr "-" "-" "-" "-" ...
# group_by year, then sum the migrants per destination country
by_dest_Country_df <- num_country_df %>%
group_by(year, destination_country, origin_country) %>%
summarise(total_migrants = sum(migrants)) %>%
# finally arrange them from big to small
arrange(desc(total_migrants))
by_dest_Country_df ## this is the dataframe that will be exported
As we can see here, we got more than 78,700 colums. I would prefer to take a subset of the data based on some statistical analysis. To do so, we can get mean, median to set a filtering criteria. Additionally, I will add a ranking column to the dataframe to rank the countries
# first get mean, median, max, min
summary(by_dest_Country_df$total_migrants) # 17319 -> take 20,000
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 39 280 17319 2342 12168662 345
# take a subset by filtering the results to only pick total migrants > mean
sub_set_migrant <- select(filter(by_dest_Country_df, total_migrants > 20000), c(1:4))
# add ranking system desc. ordered
sub_set_migrant$Rank <- order(sub_set_migrant$total_migrants, decreasing = TRUE)
sub_set_migrant
Now we get a much tider data about 7,000 columns, only 4 columns. However, we can use spread() function to investigate the number of migrants for each year interval.
gather_by_year <- sub_set_migrant %>%
spread(key = "year", value = "total_migrants")%>%
arrange(Rank)
# replace NA with 0 value
gather_by_year[, 4:10][is.na(gather_by_year[, 4:10])] <- 0
gather_by_year
Please be sure that you set working directory in Rstudio to the current working directory.
## [1] TRUE
dest_origin_df <- read.csv(file = "migrantByOriginDestination.csv", header = TRUE, sep = ",")
dest_origin_df
We can open the file from the GitHub
url <- 'https://raw.githubusercontent.com/salma71/MSDS_2019/master/Fall2019/aquisition_management_607/week_6/project_2/migrantByOriginDestination.csv'
# reading the url as a dataframe
by_origin_destination <- read.csv(url, header = TRUE, stringsAsFactors = FALSE)
head(by_origin_destination)
I would prefer to explore the data quickly using the world map. I used ggmap r package to draw world map, then I recoded the country name in the world map - only for the United States of America to be US and United Kingdom to be UK. Finally, I did a left join by region = country, by this, I was able to cast world map to fit my data.
map_world <- map_data("world")
#map_world$region
subset_migrant <- sub_set_migrant
#recode the country name to be able to left join the two data frames
head(as.factor(subset_migrant$destination_country) %>%
levels())
## [1] "Afghanistan" "Albania" "Algeria" "Andorra" "Angola"
## [6] "Argentina"
# RECODE NAMES
subset_migrant$destination_country <- recode(subset_migrant$destination_country
,'United States of America' = 'USA'
,'United Kingdom' = 'UK'
)
#rename the column to be the same as world_map, so destination_country becomes country
colnames(subset_migrant)[2] <- "country"
#View(new_dat)
# LEFT JOIN -- we join region from map into country in subset_migrant.
map_world_joined <- left_join(map_world, subset_migrant, by = c('region' = 'country'))
mapp <- map_world_joined %>%
mutate(fill_flg = ifelse(is.na(year),F,T)) %>%
filter(year == "2019")
ggplot() +
geom_polygon(data = mapp, aes(x = long, y = lat, group = group, fill = total_migrants)) +
labs(title = 'Countries with highest "Migrant Population in 2019"') +
geom_point() +
theme(text = element_text(family = "Gill Sans")
,plot.title = element_text(size = 15)
,plot.subtitle = element_text(size = 10)
,legend.text = element_text(size = 10)
)
As we can see from the plot, The top destination countries that migrant targeting is the United states of America in 2019 with more than 9M migrants. However, we need to get more insights into the last three year intervals, 2010, 2015, 2019.
# filtering the number of migrants that is more than 3M in 2010, 2015, and 2019
migration_15_19 = sub_set_migrant %>%
filter(total_migrants >= 3000000 & year %in% c(2010 ,2015, 2019)) %>%
select(destination_country, origin_country, year, total_migrants)%>%
arrange(desc(total_migrants))
migration_15_19
ggplot(migration_15_19, aes(x=destination_country, y=total_migrants, fill = destination_country)) +
geom_bar(stat="identity", width=1, position = position_dodge()) +
labs(fill="destination country") +
labs(title="Migrant population bar chart",
subtitle="Top destinations for migrats population - more than 3M migrants per year") +
theme_bw() +
theme(axis.text.x=element_blank(), axis.ticks.x=element_blank()) +
facet_grid( ~ year)
We got some interesting results here, for instance, United states of America has the highest migrants population over the three time intervals. However, the total migrants that USA accepts decreased by 1 million migrants in 2019. Turkey started accepting migrants in 2019 time interval with a significant number more than 3.5M migrants per interval. As an overall observation, more countries accepting migrants throught time where 2019 interval has the the top 6 countries accepting population.
For the next steps, I am interested in investigating what are the migrants’ origin countries that migrate to the United States and Turkey.
by_origin_country <- migration_15_19 %>%
filter(destination_country %in% c("United States of America", "United Arab Emirates", "Ukraine", "Turkey", "Russian Federation", "India") & total_migrants >= 3000000) %>%
select(destination_country, origin_country, total_migrants, year)
by_origin_country
## Named list()
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
ggplot(by_origin_country, aes(x = destination_country , y = total_migrants, colour = origin_country)) +
geom_point(size = 3) +
facet_grid(~ by_origin_country$year) +
labs(y = "total_migrants") +
scale_y_continuous() +
theme(axis.text.x = element_text(angle=65, vjust=0.65)) +
theme_linedraw()
The majority of migrant population to the United States of America is from Mexico. Also, it seams that the Turkey accepted more than 3.7M migrant from Syrian Arab Republic in 2019 interval. The other main origin countries that have most of the population are India, Ukrain, and Russian Federation respectively with no major difference in the total population.
only_3 <- migration_15_19%>%
filter(destination_country %in% c("United States of America", "United Arab Emirates", "Turkey") & year == "2019") %>%
select(destination_country, origin_country, total_migrants)
only_3
As illustrated, we can concluded that United States is the top destination for Mexicans. However, Indians like to live in the United Arab Emirates at the same time Syrians would rather to flee to Turkey
UN_Migrant_age <- read_csv("UN_MigrantStockByAgeAndSex_2019/Table 1-Table 1.csv", skip = 15)
UN_Migrant_age
As the previous dataset, we need to extract only the countries not regions. So, we will do the same steps as before
We now have 1,624 row with 62 columns. Now it is time to do some cleaning up.
migrant_age_df <- UN_Countries_age %>%
select(-X2, -X4, -X5, -X6, -starts_with("Total"), -c(58:62))
migrant_age_df
Now we have 1624 rows with a 34 columns
Since we are focusing here on what is the most age categories that had migrated from their origin countries to another countries. So I decided to subset the data into three subsets, both categories, males only, and females only. We are interested here in both categories.
#Use select function to fetch only the data that we interested by picking the colnames
both_sex_df <- migrant_age_df %>%
select(c("X1":"75+"))
both_sex_df
Now we have 18 columns, we need to convert the column type from char to num to be able to complete the summation. Second, use gather() make the data set into long format.
num_age_df <- both_sex_rename
num_age_df$`0-4` <- parse_number(num_age_df$`0-4`)
num_age_df$`5-9` <- parse_number(num_age_df$`5-9`)
num_age_df$`10-14` <- parse_number(num_age_df$`10-14`)
num_age_df$`15-19` <- parse_number(num_age_df$`15-19`)
num_age_df$`20-24` <- parse_number(num_age_df$`20-24`)
num_age_df$`25-29` <- parse_number(num_age_df$`25-29`)
num_age_df$`30-34` <- parse_number(num_age_df$`30-34`)
num_age_df$`35-39` <- parse_number(num_age_df$`35-39`)
num_age_df$`40-44` <- parse_number(num_age_df$`40-44`)
num_age_df$`45-49` <- parse_number(num_age_df$`45-49`)
num_age_df$`50-54` <- parse_number(num_age_df$`50-54`)
num_age_df$`55-59` <- parse_number(num_age_df$`55-59`)
num_age_df$`60-64` <- parse_number(num_age_df$`60-64`)
num_age_df$`65-69` <- parse_number(num_age_df$`65-69`)
num_age_df$`70-74` <- parse_number(num_age_df$`70-74`)
num_age_df$`75+` <- parse_number(num_age_df$`75+`)
num_age_df
# gather total population by age category
by_age <- num_age_df %>%
gather(age, total_population, c(3:18))
by_age
Please be sure that you set working directory in Rstudio to the current working directory.
## [1] TRUE
We can open the file from the GitHub
url <- 'https://raw.githubusercontent.com/salma71/MSDS_2019/master/Fall2019/aquisition_management_607/week_6/project_2/migrantByAge.csv'
# reading the url as a dataframe
by_age_bothsex <- read.csv(url, header = TRUE, stringsAsFactors = FALSE)
head(by_age_bothsex)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 1227 6068 55709 30883 5485048 480
by_youth <- by_age%>%
filter(age %in% c("20-24", "25-29", "30-34", "35-39") & total_population >= 55709)%>%
select(year, destination_country, total_population)%>%
arrange(desc(total_population))
by_youth
theme_set(theme_classic())
ggplot(by_youth, aes(x=year, y=total_population)) +
geom_point(col="tomato2", size=3) + # Draw points
geom_segment(aes(x=year,
xend=year,
y=min(total_population),
yend=max(total_population)),
linetype="dashed",
size=0.1) + # Draw dashed lines
labs(title="Dot Plot",
subtitle="Youth population over the years",
caption="source: by_youth") +
coord_flip()
#check how many children
by_child <- by_age%>%
filter(age %in% c("0-4", "5-9", "10-14", "15-19") & total_population >= 55709)%>%
select(year, destination_country, total_population, age)%>%
arrange(desc(total_population))
by_child
#check how many adults
by_adult <- by_age%>%
filter(age %in% c("40-44", "45-49", "50-54", "55-59") & total_population >= 55709)%>%
select(year, destination_country, total_population, age)%>%
arrange(desc(total_population))
by_adult
#check how many senior
by_senior <- by_age%>%
filter(age %in% c("60-64", "65-69", "70-74", "75+") & total_population >= 55709)%>%
select(year, destination_country, total_population, age)%>%
arrange(desc(total_population))
by_senior
I used dplyr::leftjoin -> Reduce function along with merge to combine all 4 dataframes into one.
merged_age <- Reduce(function(...) merge(..., all=TRUE, sort = FALSE), list(by_child, by_youth, by_adult, by_senior))
merged_age
Now based on the previous analysis in Dataset_1, we will pick only the top countries that have majority of population to get more insights how many migrants for each age category. I will look into those the top 6 destinations which are USA, Turkey, United Arab Emirates, India, Ukraine, and Russian Federation.
First, we need to condense the age groups into categories to be easy to visualize. To do that, we can construct a for-loop to iterate over the merged_age dataframe and change the value into the category we want child, youth, adult, and senior migrants as the table below:
child | 0 - 19
youth | 20 - 39
adult | 40 - 59
senior | 60 - 75+
for (i in 1:nrow(merged_age)) {
if(merged_age[[i,4]] %in% c("0-4" , "5-9" , "10-14" ,"15-19")) {
merged_age[[i,4]] <- "child"
} else if(merged_age[[i,4]] %in% c("20-24", "25-29" ,"30-34" ,"35-39")) {
merged_age[[i,4]] <- "youth"
} else if(merged_age[[i,4]] %in% c("40-44" ,"45-49" ,"50-54" ,"55-59")) {
merged_age[[i,4]] <- "adult"
} else if(merged_age[[i,4]] %in% c("60-64", "65-69", "70-74", "75+")) {
merged_age[[i,4]] <- "senior"
}
merged_age
}
merged_age
The dataframe now is categorized into 4 age categories; however, the dataframe has NA values need to be replaced by zero.
#replace NA <- 0
merged_age[is.na(merged_age)] <- 0
# filter the results to get only the non-zero values, 3 years intervals, and 5 top countries desc. arranged.
top_migrant_countries <- merged_age %>%
filter(destination_country %in% c("United States of America", "United Arab Emirates", "Ukraine", "Turkey", "Russian Federation", "India") & year %in% c(2010, 2015,2019) & age != 0) %>%
select(destination_country, age, total_population, year)%>%
arrange(desc(total_population))
top_migrant_countries
ggplot(top_migrant_countries, aes(x=age, y=total_population, fill = destination_country)) +
geom_bar(stat="identity",position=position_dodge()) +
labs(title="Migrant age-categories during 3 intervals",
subtitle="Top migrats age-categories for the top 6 destination countries") +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
facet_wrap(~year) +
theme_light()
Here, I wanted to get the age distribution for the best 3 countries to follow the pattern in the first dataset. So I filtered the destination_country and year to get the desired data.
best_3 <- merged_age%>%
filter(destination_country %in% c("United States of America", "United Arab Emirates", "Turkey") & year %in% c("2019") & age != 0) %>%
select(destination_country, age, total_population, year)%>%
arrange(desc(total_population))
best_3
ggplot(best_3, aes(x=age, y=total_population, group = age)) +
geom_col(aes(fill = age), position = "dodge", stat = "identity") +
labs(title="Migrant age-categories distribution in 2019",
subtitle="Top migrats age-categories for the top 3 destination countries") +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
facet_wrap(~ destination_country) +
theme_light()
library(readr)
UN_Refugee_df <- read_csv("UN_MigrantStockTotal_2019/Table 6-Table 1.csv",
skip = 15)
UN_Refugee_df
We got 284 rows with 25 columns. + get only countries not regions
## [1] 232 25
## [1] 232 21
Now we have 232 rows with a 21 columns
#Use select function to fetch only the data that we interested by picking the colnames - both males, and females
estimate_refugee_df <- refugee_df %>%
select(c("X2":"2019"))
dim(estimate_refugee_df)
## [1] 232 8
estimate_refugee_rename <- estimate_refugee_df %>%
rename(
destination_country = X2,
)
estimate_refugee_rename
Pasring years columns into doubles using parse_number() from readr package
num_est_refugee <- estimate_refugee_rename
num_est_refugee$`1990` <- parse_number(num_est_refugee$`1990`)
num_est_refugee$`1995` <- parse_number(num_est_refugee$`1995`)
num_est_refugee$`2000` <- parse_number(num_est_refugee$`2000`)
num_est_refugee$`2005` <- parse_number(num_est_refugee$`2005`)
num_est_refugee$`2010` <- parse_number(num_est_refugee$`2010`)
num_est_refugee$`2015` <- parse_number(num_est_refugee$`2015`)
num_est_refugee$`2019` <- parse_number(num_est_refugee$`2019`)
num_est_refugee
Make the data more tidy by gather total_estimates by year in 2 columns instead of 7 columns - long format.
# gather() by year into total_estimates - gather columns 2 (1990) : 8(2019) then arrange desc
by_refugee_pop <- num_est_refugee%>%
gather(year, total_estimates, c(2:8)) %>%
arrange(desc(total_estimates))
dim(by_refugee_pop)
## [1] 1624 3
Please be sure that you set working directory in Rstudio to the current working directory.
## [1] TRUE
refugeeEstimates <- read.csv(file = "refugeeEstimates.csv", header = TRUE, sep = ",")
refugeeEstimates
We can open the file from the GitHub
url <- 'https://raw.githubusercontent.com/salma71/MSDS_2019/master/Fall2019/aquisition_management_607/week_6/project_2/refugeeEstimates.csv'
# reading the url as a dataframe
ref_estimates <- read.csv(url, header = TRUE, stringsAsFactors = FALSE)
head(ref_estimates)
To follow the same pattern of analysis, I will filter the destination countries to get the top 4 countries in the last 3 year intervals.
# filter by year, destination, total estimates is not equal to 0
by_summary <- by_refugee_pop%>%
filter(destination_country %in% c("United States of America", "Turkey", "United Arab Emirates" ) & year %in% c(2010, 2015,2019) & total_estimates != 0)
by_summary
ggplot(by_summary, aes(x = destination_country, y = total_estimates, fill = destination_country)) +
geom_bar(stat = "identity",
width = 1) +
geom_text(aes(label = total_estimates),
hjust = .5,
color = "black",
size = 3.5) +
labs(title="Refugee estimates for top 3 destinations",
subtitle="How much refugees the top three destinations can accept per year interval") +
theme_bw() +
theme(
axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
facet_wrap(~ year)
From 2010 to 2019, a significant increase of refugess was estimated in Turkey. Specifically, the refugess estimates increased from 10,000 to 3.7 million, which was the largest number of refugees ever predicted in all countries. Similarly, but with a much lower rate, the number of refugees in the United States increased from 270,000 to 929,762. In contrast, the estimated refugees in the United Arab Emirates was the smallest (<3000).
The analysis gave a basic idea on average migration populations on different areas in the world and how many more migrants these destinations would accept. I would like to extend the study to investigate what factors influence migrants to leave their country and target those destinations, it it life quality, employment opportunities,..etc. Also, I am curious to know what level of education migrant have are they high educational migrant, high skilled technicians,..etc.