Data Source: “Talent Migration Data” by World Bank Group & LinkedIn Corporation
Questions:
For the purpose of this report I am displaying the setup code below. In my setup code I am:
knitr::opts_chunk$set(echo = TRUE)
library(readr)
library(dplyr)
library(tidyverse)
industry <- read_csv("industry_migration_public.csv")
Step 1: I am going to format my data to make it usable.
First I am going to select only a subset of variables that I am interested in including in my analysis.
To do this I am going to start off by looking at all the variables in the industries data set:
print(names(industry))
## [1] "country_code" "country_name" "wb_income"
## [4] "wb_region" "isic_section_index" "isic_section_name"
## [7] "industry_id" "industry_name" "net_per_10K_2015"
## [10] "net_per_10K_2016" "net_per_10K_2017" "net_per_10K_2018"
## [13] "net_per_10K_2019" "...14" "...15"
## [16] "...16" "...17" "...18"
## [19] "...19" "...20" "...21"
## [22] "...22" "...23" "...24"
## [25] "...25" "...26" "...27"
## [28] "...28" "...29" "...30"
I am going to create a new data frame ‘industry_analysis’ in which I only select the variables that I want to include in my analysis
industry_analysis <- data.frame(industry$country_name, industry$wb_region,industry$isic_section_name,industry$industry_name, industry$net_per_10K_2015, industry$net_per_10K_2016, industry$net_per_10K_2017, industry$net_per_10K_2018, industry$net_per_10K_2019)
Next I am going to rename the variables.
names(industry_analysis) <- c("country_name","region","sector_name","industry_name","net2015","net2016","net2017","net2018","net2019")
Finally I am going to add a variable to my new data frame in which I calculate the total net migration for periods 2015 to 2019. I will call this variable ‘net_total’
industry_analysis$net_total <- industry_analysis$net2015 + industry_analysis$net2016 + industry_analysis$net2017 + industry_analysis$net2018 + industry_analysis$net2019
Now I am going to look at the structure of my data set by displaying the first 10 rows of data in a table.
knitr::kable(
head(industry_analysis, n=10),
label = "First 10 rows of data in my industries analysis data frame"
)
| country_name | region | sector_name | industry_name | net2015 | net2016 | net2017 | net2018 | net2019 | net_total |
|---|---|---|---|---|---|---|---|---|---|
| United Arab Emirates | Middle East & North Africa | Manufacturing | Defense & Space | 378.74 | 127.94 | 8.20 | 68.51 | 49.55 | 632.94 |
| United Arab Emirates | Middle East & North Africa | Information and communication | Computer Hardware | 100.97 | 358.14 | 112.98 | 149.57 | 182.22 | 903.88 |
| United Arab Emirates | Middle East & North Africa | Information and communication | Computer Software | 1079.36 | 848.15 | 596.48 | 409.18 | 407.41 | 3340.58 |
| United Arab Emirates | Middle East & North Africa | Information and communication | Computer Networking | 401.46 | 447.39 | 163.99 | 236.69 | 188.07 | 1437.60 |
| United Arab Emirates | Middle East & North Africa | Information and communication | Internet | 1840.33 | 1368.42 | 877.71 | 852.39 | 519.40 | 5458.25 |
| United Arab Emirates | Middle East & North Africa | Information and communication | Telecommunications | 676.91 | 676.11 | 365.96 | 283.59 | 281.91 | 2284.48 |
| United Arab Emirates | Middle East & North Africa | Professional scientific and technical activities | Law Practice | 694.50 | 529.50 | 362.83 | 356.56 | 147.83 | 2091.22 |
| United Arab Emirates | Middle East & North Africa | Professional scientific and technical activities | Legal Services | 661.10 | 665.65 | 462.30 | 475.62 | 255.21 | 2519.88 |
| United Arab Emirates | Middle East & North Africa | Professional scientific and technical activities | Management Consulting | 982.85 | 694.25 | 520.28 | 574.60 | 406.03 | 3178.01 |
| United Arab Emirates | Middle East & North Africa | Professional scientific and technical activities | Biotechnology | 1230.17 | 846.28 | 500.77 | 205.78 | 513.23 | 3296.23 |
Step 2: Data Analysis
Now that I have the data in a format that is usable I can start with my data analysis to answer my questions.
Which industries experienced the biggest net migration in talent for the periods from 2015 to 2019?
industry_rank <- industry_analysis %>% group_by(industry_name) %>% summarise(sum_net_total = sum(net_total,na.rm = TRUE), .groups = "drop")
industry_rank_ordered <- industry_rank[with(industry_rank, order(-industry_rank$sum_net_total)),]
knitr::kable(
head(industry_rank_ordered,n=5),
label = "Top 5 industries with highest net migration from 2015 - 2019"
)
| industry_name | sum_net_total |
|---|---|
| Internet | 46852.93 |
| International Trade & Development | 30233.38 |
| Renewables & Environment | 29982.72 |
| Outsourcing/Offshoring | 29527.62 |
| E-Learning | 29067.23 |
Which region experienced the biggest net increase in talent migration in the Information and communication sector for the periods from 2015 to 2019?
industry_region_df <- subset(industry_analysis,sector_name == "Information and communication")
region_it_rank <- industry_region_df %>% group_by(region, sector_name) %>% summarise (sum_net_total = sum(net_total, na.rm = TRUE), .groups = "drop")
region_it_rank_ordered <- region_it_rank[with(region_it_rank, order(-region_it_rank$sum_net_total)),]
knitr::kable(
region_it_rank_ordered,
label = "Regions with most to least talent migration in Information & Communications Sector"
)
| region | sector_name | sum_net_total |
|---|---|---|
| Europe & Central Asia | Information and communication | 105249.34 |
| East Asia & Pacific | Information and communication | 56354.44 |
| Middle East & North Africa | Information and communication | 27855.85 |
| North America | Information and communication | 17922.41 |
| Sub-Saharan Africa | Information and communication | 3454.66 |
| South Asia | Information and communication | -20810.75 |
| Latin America & Caribbean | Information and communication | -28691.58 |
As an additional step I also want to know how many countries are included in each region and I want to add that count as a variable to my results.
region_country <- as.data.frame(unique(industry_region_df[,1:2]))
region_country_count <- region_country %>% group_by(region) %>% count(region)
result_df <- merge(region_it_rank_ordered,region_country_count,by="region")
result_df <- result_df[with(result_df,order(-result_df$sum_net_total)),]
names(result_df)[4] <- "num_countries"
knitr::kable(
result_df,
label = "Regions (including country count) with most to least talent migration in Information & Communications"
)
| region | sector_name | sum_net_total | num_countries | |
|---|---|---|---|---|
| 2 | Europe & Central Asia | Information and communication | 105249.34 | 42 |
| 1 | East Asia & Pacific | Information and communication | 56354.44 | 15 |
| 4 | Middle East & North Africa | Information and communication | 27855.85 | 18 |
| 5 | North America | Information and communication | 17922.41 | 2 |
| 7 | Sub-Saharan Africa | Information and communication | 3454.66 | 18 |
| 6 | South Asia | Information and communication | -20810.75 | 6 |
| 3 | Latin America & Caribbean | Information and communication | -28691.58 | 18 |
Step 3: Charting
Lastly I want to chart the top 20 and bottom 20 industries to visually display the results:
t20_industries <- head(industry_rank_ordered,n=20)
t20_industries %>% mutate(industry_name = fct_reorder(industry_name,sum_net_total)) %>%
ggplot( aes(industry_name,sum_net_total)) + geom_bar(stat="identity", fill="#44AD50") + coord_flip() + xlab("Industry") + ylab("Net Movement (per 50k) 2015 - 2019")
b20_industries <- tail(industry_rank_ordered,n=20)
b20_industries %>% mutate(industry_name = fct_reorder(industry_name,sum_net_total)) %>%
ggplot( aes(industry_name,sum_net_total)) + geom_bar(stat="identity", fill="#446EAD") + coord_flip() + xlab("Industry") + ylab("Net Movement (per 50k) 2015 - 2019")