Load packages
library(readr) # Useful for importing data
library(readxl)
library(foreign) # Useful for importing SPSS, SAS, STATA etc. data files
library(gdata)
library(rvest) # Useful for scraping HTML data
library(dplyr)
library(tidyr)
library(deductive)
library(deducorrect)
library(editrules)
library(validate)
library(Hmisc)
library(forecast)
library(stringr)
library(lubridate)
library(car)
library(outliers)
library(MVN)
library(infotheo)
library(MASS)
library(caret)
library(mlr)
library(ggplot2)
library(knitr) # Useful for creating nice tables
In economics, the Gini coefficient (/ˈdʒiːni/ JEE-nee), sometimes called the Gini index or Gini ratio, is a measure of statistical dispersion intended to represent the income or wealth distribution of a nation’s residents, and is the most commonly used measurement of inequality.
I decided to analyse what level of inequality exists in the family immigration stream for Australia. I obtained the immigration data from data.gov.au and the gini coefficient data from World bank.
Firstly, I joined the gini data with the gini metadata to form a gini dataset and I filtered it for the last 10 years. I then converted it to a long format, immigration data was also converted to long format to enable join. I adjusted the country names as they were inconsistent across the datasets.I imputed any missing values with a group average and analysed the immigration counts against the gini coefficients for different countries in the top 15 for family immigration stream.
The datasets are open source and have been downloaded from teh following locations.
1. Australian Migration Statistics, 2009-2019 from data.gov.au
https://data.gov.au/dataset/ds-dga-dba45e7c-81f4-44aa-9d82-1b9a0a121017/distribution/dist-dga-afbf44d6-080b-48d1-bffb-093887ce9a65/details?q=
This data is in excel format and contains data in several worksheets. For this analysis, I chose to analyse “All other Family stream outcome and Child outcome visas—top 15 citizenship countries, Parent, Other Family and Child visas, 2009–10 to 2018–19” in worksheet “1.10”. Data is in a cross tab format, with Countries as column headers and yearwise migration numbers in rows.
2. GINI index (World Bank estimate)
https://data.worldbank.org/indicator/SI.POV.GINI
The data is downlaoded as a zip file, which contains two csv files inside, one with gini index for most countries in the world, from 1960 to 2018. A separate CSV file provides additional information for the countries, like region and income group.
The aim of the this analysis is to combine the GINI index data with the family stream migration data for the top 15 countries in Australia for 2009-2019, to understand, if GINI index is correlated with family migration.
Set working directory
#setwd('/Users/Sid/Desktop/RMIT/MATH2349/ass2/data') #set working directory to the path where datafile was saved
immigration_all_file <- "migration_trends_statistical_package_2018_19.xlsx"
gini_file <- "API_SI.POV.GINI_DS2_en_csv_v2_1120931.csv"
gini_metadata_file <- "Metadata_Country_API_SI.POV.GINI_DS2_en_csv_v2_1120931.csv"
immigration_all <- read_excel(immigration_all_file, range = "1.10!B6:S16")
head(immigration_all)
Read gini data
gini <- read_csv(gini_file, skip =4) # Gini Index data - skip 4 rows as these are not required
Missing column names filled in: 'X65' [65]Parsed with column specification:
cols(
.default = col_double(),
`Country Name` = [31mcol_character()[39m,
`Country Code` = [31mcol_character()[39m,
`Indicator Name` = [31mcol_character()[39m,
`Indicator Code` = [31mcol_character()[39m,
`1960` = [33mcol_logical()[39m,
`1961` = [33mcol_logical()[39m,
`1962` = [33mcol_logical()[39m,
`1963` = [33mcol_logical()[39m,
`1964` = [33mcol_logical()[39m,
`1965` = [33mcol_logical()[39m,
`1966` = [33mcol_logical()[39m,
`1968` = [33mcol_logical()[39m,
`1970` = [33mcol_logical()[39m,
`1972` = [33mcol_logical()[39m,
`1973` = [33mcol_logical()[39m,
`1976` = [33mcol_logical()[39m,
`1977` = [33mcol_logical()[39m,
`2019` = [33mcol_logical()[39m,
X65 = [33mcol_logical()[39m
)
See spec(...) for full column specifications.
head(gini)
Read gini metadata
gini_meta <- read_csv(gini_metadata_file) # Gini metadata file providing information about the countries
Missing column names filled in: 'X6' [6]Parsed with column specification:
cols(
`Country Code` = [31mcol_character()[39m,
Region = [31mcol_character()[39m,
IncomeGroup = [31mcol_character()[39m,
SpecialNotes = [31mcol_character()[39m,
TableName = [31mcol_character()[39m,
X6 = [33mcol_logical()[39m
)
head(gini_meta)
Observations for the datasets.
1. Immigration data - UNTIDY, because columns are not variables, rather they are values for a country variable.
10 observations of 18 variables
2. gini index dataset has lots of missing data.
3. gini metadata dataset also has some missing data.
I will first convert the year columns for immigration data by splitting the column and then convert to a tidy dataset using the “separate” function in tidyr. I will also drop “New Zealand2”,“Other3”,“Total” columns as they are either summary columns or have missing data.
#Change the country names to make it consistent with both datasets.
names(immigration_all)[names(immigration_all)=="People's Republic of China"] <- "China"
names(immigration_all)[names(immigration_all)=="Hong Kong1"] <- "Hong Kong SAR, China"
names(immigration_all)[names(immigration_all)=="United States of America"] <- "United States"
names(immigration_all)[names(immigration_all)=="Iran"] <- "Iran, Islamic Rep."
immigration_all <- tidyr::separate(immigration_all,Year, c("Year", "Discard"))
immigration_all <- immigration_all[ , -which(names(immigration_all) %in% c("New Zealand2","Other3","Total","Discard"))]
immigration_subset <- gather(data =immigration_all, key = "country_name", "immi_count",2:15)
Immigration subset is now in long format with 140 observations across 3 variables
head(immigration_subset)
Firstly, the datasets will be subset to choose important variables and then joined.
#change variable names to remove spaces and to enable easy join
names(gini)[names(gini)=="Country Code"] <- "country_code"
names(gini)[names(gini)=="Country Name"] <- "country_name"
names(gini_meta)[names(gini_meta)=="Country Code"] <- "country_code"
gini_all <- gini %>% left_join(gini_meta, by = "country_code") # Use a left join to select metadata for all available countries
Subset the gini dataset to chose the variables of interest. I’m only interest in years 2009 - 2018. I laso choose region, income group and country information
colnames <- c("country_name","country_code","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","Region","IncomeGroup")
gini_all <- gini_all[colnames]
Summary of the joined gini dataset
summary(gini_all)
country_name country_code 2009 2010 2011 2012 2013 2014
Length:264 Length:264 Min. :24.80 Min. :24.80 Min. :24.60 Min. :24.70 Min. :24.60 Min. :24.00
Class :character Class :character 1st Qu.:31.20 1st Qu.:30.40 1st Qu.:30.23 1st Qu.:29.55 1st Qu.:30.77 1st Qu.:31.12
Mode :character Mode :character Median :35.40 Median :34.10 Median :34.15 Median :35.60 Median :36.20 Median :35.15
Mean :37.58 Mean :36.54 Mean :36.20 Mean :36.45 Mean :36.54 Mean :36.65
3rd Qu.:44.10 3rd Qu.:41.35 3rd Qu.:42.33 3rd Qu.:41.25 3rd Qu.:40.92 3rd Qu.:40.90
Max. :61.00 Max. :63.40 Max. :56.20 Max. :56.10 Max. :52.80 Max. :63.00
NA's :187 NA's :181 NA's :188 NA's :185 NA's :188 NA's :184
2015 2016 2017 2018 2019 Region IncomeGroup
Min. :25.40 Min. :24.80 Min. :24.20 Min. :25.20 Mode:logical Length:264 Length:264
1st Qu.:31.55 1st Qu.:31.12 1st Qu.:29.88 1st Qu.:35.70 NA's:264 Class :character Class :character
Median :35.75 Median :35.10 Median :35.75 Median :40.55 Mode :character Mode :character
Mean :36.81 Mean :36.38 Mean :36.17 Mean :40.17
3rd Qu.:41.12 3rd Qu.:41.52 3rd Qu.:41.10 3rd Qu.:45.60
Max. :59.10 Max. :54.60 Max. :56.30 Max. :53.90
NA's :184 NA's :188 NA's :198 NA's :236
As can be seen gini data is missing for most countries and years. These will be imputed later.
#Convert factor columns to factor
factor_cols <- c("country_code","Region")
gini_all[factor_cols] <- lapply(gini_all[factor_cols] , factor)
gini_all$IncomeGroup <- factor(gini_all$IncomeGroup, levels = c("High income","Upper middle income","Lower middle income", "Low income")) # Ordered factor
Convert dataset to long format.
#gini_all is also an untidy dataset as year values are in column headers. Convert to long format to enable join.
gini_long <- gather(data =gini_all, key = "Year", "gini",3:13)
#Join immigration data to gini data
immi_gini <- left_join(immigration_subset, gini_long)
Joining, by = c("Year", "country_name")
factor_cols <- c("Year")
immi_gini[factor_cols] <- lapply(immi_gini[factor_cols] , factor)
The joined dataset is as follows.
head(immi_gini)
Symmary of the processed dataset.
summary(immi_gini)
Year country_name immi_count country_code Region IncomeGroup gini
2009 :14 Length:140 Min. : 54.0 CHN :10 East Asia & Pacific :70 High income :30 Min. :29.80
2010 :14 Class :character 1st Qu.: 170.8 GBR :10 Europe & Central Asia :10 Upper middle income:60 1st Qu.:36.10
2011 :14 Mode :character Median : 280.0 HKG :10 Latin America & Caribbean : 0 Lower middle income:50 Median :39.10
2012 :14 Mean : 723.6 IDN :10 Middle East & North Africa:10 Low income : 0 Mean :39.03
2013 :14 3rd Qu.: 717.8 IND :10 North America :10 3rd Qu.:41.00
2014 :14 Max. :5561.0 IRN :10 South Asia :30 Max. :63.40
(Other):56 (Other):80 Sub-Saharan Africa :10 NA's :74
As can be seen , I now have a dataframe with four factor variables (Year, Country_code, Region,IncomeGroup), two numeric(Immi_count, gini) and one character(Country_name) variable.
Firstly, I check for any special values in the numeric variables.
# Check Immi_count variable for special values
is.finite(immi_gini$immi_count)
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[28] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[55] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[82] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[109] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[136] TRUE TRUE TRUE TRUE TRUE
is.infinite(immi_gini$immi_count)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[24] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[47] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[70] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[93] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[116] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[139] FALSE FALSE
is.nan(immi_gini$immi_count)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[24] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[47] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[70] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[93] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[116] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[139] FALSE FALSE
No special values or NaN values in the immigration count variable.
# Check Immi_count variable for special values
is.finite(immi_gini$gini)
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[24] TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
[47] TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE
[70] FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[93] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE
[116] FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
[139] TRUE FALSE
is.infinite(immi_gini$gini)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[24] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[47] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[70] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[93] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[116] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[139] FALSE FALSE
is.nan(immi_gini$gini)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[24] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[47] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[70] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[93] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[116] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[139] FALSE FALSE
No infinite values or NaN values, but there are NAs in the gini variable.
na_val <- sapply(immi_gini, function(x) sum(is.na(x))) #Check for NA values in the whole data frame
na_val
Year country_name immi_count country_code Region IncomeGroup gini
0 0 0 0 0 0 74
#Impute missing gini scores as average for the country
immi_gini$gini<- round(ave(immi_gini$gini,immi_gini$country_code,FUN=function(x) ifelse(is.na(x), mean(x,na.rm=TRUE), x)),2)
na_val <- sapply(immi_gini, function(x) sum(is.na(x)))
na_val
Year country_name immi_count country_code Region IncomeGroup gini
0 0 0 0 0 0 10
#I still have 10 missing values, which are for Hong Kong. As Hong Kong is a part of China, I impute the GINI values as a mean of China gini scores
immi_gini[which(immi_gini$country_code == "HKG"),]$gini <- Hmisc::impute(immi_gini[which(immi_gini$country_code == "CHN"),]$gini, fun = mean)
na_val <- sapply(immi_gini, function(x) sum(is.na(x)))
na_val
Year country_name immi_count country_code Region IncomeGroup gini
0 0 0 0 0 0 0
The dataset is clean now.
I now scan the numeric variables for outliers.
#outlier Detection using z-scores for immigration count
z_immig.scores <- immi_gini$immi_count %>% scores(type = "z")
z_immig.scores %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.553152 -0.456701 -0.366447 0.000000 -0.004809 3.996336
Immigration numbers vary per year and there can be occurences where a particular country has a disproportionate value for immigration. Hence I chose not to discard this value.
immi_gini[ which( abs(z_immig.scores) >3 ) ,]$country_name
[1] "China" "China" "China" "China" "China" "China" "China" "China" "China"
China is represented disproportionately when it comes to immigration numbers in the family stream, when compared to other countries in top 15.
immi_gini$immi_count %>% boxplot(main="Box Plot of Immigration Count", ylab="Immigration numbers", col = "Blue")
It can be seen that certain most of the countries within top 15 are sending less than 1000 persons to Australia per year, however, some countries have high immigration numbers. This is expected, hence I do not remove any values.
#outlier Detection using z-scores for gini
z.scores <- immi_gini$gini %>% scores(type = "z")
z.scores %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.4787 -0.4906 -0.1497 0.0000 0.1499 3.1488
which( abs(z.scores) >3 )
[1] 71 72 73 74 75 76 77 78 79 80
immi_gini[ which( abs(z.scores) >3 ) ,]$country_name
[1] "South Africa" "South Africa" "South Africa" "South Africa" "South Africa" "South Africa" "South Africa" "South Africa" "South Africa"
[10] "South Africa"
Gini scores show an even spread, however, South Africa has a higher level of inequality amongst the top 15 countries in teh family migration stream.
Using mutate, create a new variable that is the average for the country.
z_gini <- immi_gini
z_gini$country_average_gini <- z_gini %>%
group_by(country_code) %>%
mutate(
gini_mean = mean(gini)
) %>% ungroup %>% dplyr::select(gini_mean) %>% round(2)
This column can be used to compare how the immigration numbers are tracking against the mean gini coefficient, in each country, so we can analyse how immigration numbers are being affected by a change in the inequality score.
I scale and center the gini coefficient so that the trends in inequality can be easily identified. I then create a histogram and box plot to visualise the distribution.
# This is the R chunk for the Transform Section
z_gini$scaled_gini <- scale(z_gini$country_average_gini, center = TRUE, scale = TRUE)
hist(z_gini$scaled_gini,main = "Inequality trend for family migration for top 15 countries")
As can be seen, the inequality in the top 15 countries in the family immigration stream is generally declining. However, there is as slight increase in some countries, and South Africa has a high level of inequality and bucks the trend when compared to other countries.
##References
How to import R data from excel sheet with multiple worksheets. Last accessed 6th June 2020.
https://stackoverflow.com/questions/12945687/read-all-worksheets-in-an-excel-workbook-into-an-r-list-with-data-frames
https://en.wikipedia.org/wiki/Gini_coefficient Last accessed 7th June 2020.
In economics, the Gini coefficient (/ˈdʒiːni/ JEE-nee), sometimes called the Gini index or Gini ratio, is a measure of statistical dispersion intended to represent the income or wealth distribution of a nation’s residents, and is the most commonly used measurement of inequality. A Gini index value above 50 is considered high; countries including Brazil, Colombia, South Africa, Botswana, and Honduras can be found in this category. A Gini index value of 30 or above is considered medium; countries including Vietnam, Mexico, the United States, Argentina, Russia, and Uruguay can be found in this category. A Gini index value lower than 30 is considered low; countries including Austria, Germany, Denmark, Norway, Poland, Slovenia, Sweden, and Ukraine can be found in this category