Required packages

library(readxl)
library(tidyr)
library(dplyr)
library(editrules)
library(outliers)
library(MVN)

Executive Summary

The datasets of regional population by gender and age in 2018-2019 were obtained from Australian Bureau of Statistics to gain more insights about gender ratio in regional areas in Victoria and how their population changed from 2018 to 2019.

The first two datasets were imported as xlsx. files for the population estimates by sex across Local Government Areas (LGA) in 2018 and filtered for Victorian region. The final dataset described the total population estimates across LGA in Victoria in 2018-2019. All of the imported datasets were then joined using LGA as key variables and checked for data structure. Data conversion was taken from character to numeric values. The merged dataset was untidy because column headers were not variables so that tidying data was carried out for gender columns with the estimated number of residents as values. Two new variables were created with mutate function to determine the population ratio by sex and the population growth between 2018-2019 for LGAs.

The tidy dataset was then scanned for missing values and special values for all numeric values. The obvious errors of numeric data were also scanned for the total population of 2018 and 2019 to ensure that there were no negative values in the variables. The outliers were detected for following variables: Estimated residents, Population ratio and Growth rate using boxplot. The location of these outliers was identified using z.scores method for univariate data. The data from total population of 2018 and 2019 was scanned for outliers as multivariate data using Mahalanobis distance approach and visualised by Chi-square Q-Q plot. Following this, since the outliers were decided not to eliminate from the dataset, data transformation was taken for these two variables using \(ln\) transformation to reduce the right skewness in the distributions. The final dataset was ready for further statistical analysis.

Data

First dataset

The Population Estimates by Age and Sex, Local Government Areas (ASGS 2018), 2018 .xls was collected from Australian Bureau of Statistics at https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3235.02018?OpenDocument&fbclid=IwAR0JOzaSa6clO2HYae-2ZbsdXZjNBuk5XK-Z5YEM0NhCuoCjW-fp8IkGbtQ, which measured the estimate of the residental population in Australia by age and gender corresponding to State and Local Government Areas at 30 June 2018.

The original dataset #1 was firstly imported with 7 skipped rows due to invalid headers and stored as pop_female for sheet named “Table 2”.

Since the study focused on total population by gender in Victoria, the S/T code under ASGS 2018 for Victoria (S/T code = 2) was filtered from the original dataset using filter().

The column names of vic_pop_female were then renamed using colnames().

Since the study would not consider the number of residents under Age group categories, only Total Female population per LGA was selected from vic_pop_female using select() and assigned into new object called data_female.

The data_female now contained four main variables:

  • S/T code [categorical]: State/Territory code
  • S/T name [character]: State/Territory name
  • LGA code [categorical]: code of Local Government Areas per State e.g 20110 for Alpine (S)
  • LGA Name [character]: name of Local Government Areas per State e.g Alpine
  • Total Female population [numeric]: the total of female residents from all age groups by LGA
pop_female <- read_excel("Population Estimates by Age and Sex, Local Government Areas (ASGS 2018), 2018 .xls", sheet = "Table 2", skip = 7)
New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
vic_pop_female <- pop_female %>% filter(`ASGS 2018` == 2) 
colnames(vic_pop_female) <- c("S/T code", "S/T name", "LGA code", "LGA name", "Age 0-4", "Age 5-9", "Age 10-14", "Age 15-19", "Age 20-24", "Age 25-29", "Age 30-34", "Age 35-39", "Age 40-44", "Age 45-49", "Age 50-54", "Age 55-59", "Age 60-64", "Age 65-69", "Age 70-74", "Age 75-79", "Age 80-84", "Age 85 and over", "Total Female population")
data_female <- vic_pop_female %>% select(`S/T code` , `S/T name`, `LGA code`, `LGA name`, `Total Female population`)
head(data_female)

Second dataset

The pop_male was also imported from Population Estimates by Age and Sex, Local Government Areas (ASGS 2018), 2018 .xls with sheet named “Table 1”.

Since the study focused on total population by gender in Victoria, the S/T code under ASGS 2018 for Victoria (S/T code = 2) was also filtered from pop_male using filter().

The column names of vic_pop_male were also renamed using colnames().

Since the study would not consider the number of residents under Age group categories, only Total Male population per LGA was selected from vic_pop_male using select() and assigned into new object called data_male.

The data_female now contained four main variables:

  • S/T code [categorical]: State/Territory code
  • S/T name [character]: State/Territory name
  • LGA code [categorical]: code of Local Government Areas per State e.g 20110 for Alpine (S)
  • LGA Name [character]: name of Local Government Areas per State e.g Alpine
  • Total Male population [numeric]: the total of female residents from all age groups by LGA
pop_male <- read_excel("Population Estimates by Age and Sex, Local Government Areas (ASGS 2018), 2018 .xls", sheet = "Table 1", skip = 7)
New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
vic_pop_male <- pop_male %>% filter(`ASGS 2018` == 2) 
colnames(vic_pop_male) <- c("S/T code", "S/T name", "LGA code", "LGA name", "Age 0-4", "Age 5-9", "Age 10-14", "Age 15-19", "Age 20-24", "Age 25-29", "Age 30-34", "Age 35-39", "Age 40-44", "Age 45-49", "Age 50-54", "Age 55-59", "Age 60-64", "Age 65-69", "Age 70-74", "Age 75-79", "Age 80-84", "Age 85 and over", "Total Male population")
data_male <- vic_pop_male %>% select(`S/T code` , `S/T name`, `LGA code`, `LGA name`, `Total Male population`)
head(data_male)

Third dataset

The Population Estimates by Local Government Area, 2018 to 2019 .xls was also obtained from Australian Bureau of Statistics at https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3218.02018-19?OpenDocument&fbclid=IwAR205_G_WwqXHZDy0SJxsXfJX_SbSNN4owAo_uYnsQ1ecqmhMsuTpaJ4xVw, which indicated the total population for 2018 and 2019 as well as proportion for different components of population change (natural increase, net overseas migration, net internal migration) by LGA across States.

The original dataset #2 was firstly imported with sheet named “Table 2” (Victoria) and 6 skipped rows due to invalid headers and stored as pop_vic.

The total population of Victoria by LGA in 2018 and 2019 were selected from the original dataset pop_vic using select() and assigned into new object new_pop_vicfor later merging datasets.

The column names of new_pop_vic was renamed and now includes three variables:

  • LGA code [categorical]: code of Local Government Areas in Victoria
  • LGA name[characer]: name of Local Government Areas in Victoria
  • 2018[numeric]: total population in 2018 by lGA
  • 2019[numeric]: total population in 2019 by lGA
pop_vic <- read_excel("Population Estimates by Local Government Area, 2018 to 2019 .xls", sheet = "Table 2", skip = 6)
New names:
* `` -> ...1
* `` -> ...2
* `` -> ...5
* `` -> ...7
* `` -> ...8
* … and 1 more problem
new_pop_vic <- pop_vic %>% select (c(1,2,3,4)) 
colnames(new_pop_vic) <- c("LGA code", "LGA name", "Total_2018", "Total_2019")
head(new_pop_vic)
NA

Merging datasets

The data_female and data_male were firstly merged using inner_join which used S/T code, S/T name,LGA code and LGA name as key variables and stored as data_gender.

  • Data_gender described the total population by gender across Local Government Areas in Victoria by 2018.
data_gender <- inner_join(data_female, data_male,by = c("S/T code", "S/T name", "LGA code", "LGA name"))
head(data_gender)

The final dataset pop_vic_gender was then created by merging data_gender and new_pop_vic using inner_join which used both LGA code and LGA name as key variables.

The final dataset now includes 8 variables:

  • S/T code [categorical]: State/Territory code
  • S/T name [character]: State/Territory name
  • LGA code [categorical]: code of Local Government Areas per State e.g 20110 for Alpine (S)
  • LGA Name [character]: name of Local Government Areas per State e.g Alpine
  • Total Female population [numeric]: the total of female residents by LGA
  • Total Male population [numeric]: the total of male residents by LGA
  • Total_2018[numeric]: total population in 2018 by lGA
  • Total_2019[numeric]: total population in 2019 by lGA
pop_vic_gender <- inner_join(data_gender, new_pop_vic, by = c("LGA code", "LGA name"))
head(pop_vic_gender)

Understand

The dimension of the dataset was checked using dim(), which resulted at 80 observations and 8 variables.

The structure of pop_vic_gender showed that all variables were in character type. Therefore, the conversion to numeric data type and factor were needed.

dim(pop_vic_gender)
[1] 80  8
str(pop_vic_gender)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   80 obs. of  8 variables:
 $ S/T code               : chr  "2" "2" "2" "2" ...
 $ S/T name               : chr  "Victoria" "Victoria" "Victoria" "Victoria" ...
 $ LGA code               : chr  "20110" "20260" "20570" "20660" ...
 $ LGA name               : chr  "Alpine (S)" "Ararat (RC)" "Ballarat (C)" "Banyule (C)" ...
 $ Total Female population: chr  "6391" "5468" "55293" "66536" ...
 $ Total Male population  : chr  "6339" "6327" "52032" "63701" ...
 $ Total_2018             : chr  "12730" "11795" "107324" "130250" ...
 $ Total_2019             : chr  "12814" "11845" "109505" "131631" ...
col.num <- c("S/T code", "Total Female population", "Total Male population", "Total_2018", "Total_2019")
pop_vic_gender[col.num] <- sapply(pop_vic_gender[col.num],as.numeric)
pop_vic_gender$`LGA code` <- as.factor(pop_vic_gender$`LGA code`)
str(pop_vic_gender)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   80 obs. of  8 variables:
 $ S/T code               : num  2 2 2 2 2 2 2 2 2 2 ...
 $ S/T name               : chr  "Victoria" "Victoria" "Victoria" "Victoria" ...
 $ LGA code               : Factor w/ 80 levels "20110","20260",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ LGA name               : chr  "Alpine (S)" "Ararat (RC)" "Ballarat (C)" "Banyule (C)" ...
 $ Total Female population: num  6391 5468 55293 66536 18021 ...
 $ Total Male population  : num  6339 6327 52032 63701 17306 ...
 $ Total_2018             : num  12730 11795 107324 130250 35326 ...
 $ Total_2019             : num  12814 11845 109505 131631 36320 ...

Tidy & Manipulate Data I

The pop_vic_gender dataset was untidy because Total Female population and Total Male population as column headers were not variables.

gather() was used to gather female and male columns into one single column named Gender and stored the values under Estimated residents column.

A tidy dataset was then stored as tidy_pop_gender.

tidy_pop_gender <- pop_vic_gender %>% gather(`Total Female population`, `Total Male population`,  key = "Gender", value = "Estimated residents")
head(tidy_pop_gender)
is.factor(tidy_pop_gender$Gender)
[1] FALSE
tidy_pop_gender$Gender <- as.factor(tidy_pop_gender$Gender)
tidy_pop_gender$Gender<- factor(tidy_pop_gender$Gender, level = c("Total Female population", "Total Male population"), labels = c("female", "male"))
tidy_pop_gender <- tidy_pop_gender[ , c(1,2,3,4,7,8,5,6)]
head(tidy_pop_gender)

Tidy & Manipulate Data II

The Population ratio was created to estimate the proportion of female and male residents by total population per Local Government Areas in Victoria by 2018.

The Growth rate was also mutated to determine the growth rate of population between 2018 and 2019 for every Local Government Areas in Victoria.

Growth rate = \(\frac{Present value - Past value}{Past value}*100\)
new_pop <- tidy_pop_gender %>% mutate("Population ratio" = round((`Estimated residents`/`Total_2018`)*100, 2), "Growth rate" = ((`Total_2019` - `Total_2018`)/`Total_2018`)*100)
head(new_pop)
new_pop %>% group_by(`Gender`) %>% summarise(avergage = mean(`Population ratio`, na.rm = TRUE))

Scan I

Scan for missing values

The missing values (NA) were scanned for all numeric values and output as the total number of missing values for each column using sum(is.na()).

The sapply() was used to apply sum(is.na()) in multiple columns for scanning.

There is no NA values detected in the dataset new_pop.

sapply(new_pop, function(x) sum(is.na(x)))
           S/T code            S/T name            LGA code            LGA name 
                  0                   0                   0                   0 
             Gender Estimated residents          Total_2018          Total_2019 
                  0                   0                   0                   0 
   Population ratio         Growth rate 
                  0                   0 

Scan for special values

The is.special function is created to check for special values that are either infinite values or Not a Number values (nan) in the dataset.

The special values were scanned for all numeric data in new_pop dataset and the output as the total number of special values for each columns using sum(is.special()).

The sapply() was also used to apply sum(is.special()) in multiple columns for scanning special values.

There is no special value detected in the dataset.

is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
sapply(new_pop, function(x) sum(is.special(x)))
           S/T code            S/T name            LGA code            LGA name 
                  0                   0                   0                   0 
             Gender Estimated residents          Total_2018          Total_2019 
                  0                   0                   0                   0 
   Population ratio         Growth rate 
                  0                   0 

Scan for obvivous errors

The data inconsistencies were checked for the dataset by generating a rule for numeric data. The rule is no negative numeric values in the total population of 2018 and 2019.

The rule was created using editset() and assigned it into rule.

The rule was then applied to the dataset using violatedEdits() to output logical values where TRUE is where inconsistencies are existed.

  • The output turned out to have no data inconsistencies in the dataset.
rule <- editset(c("Total_2018 > 0", "Total_2019 > 0"))
rule

Edit set:
num1 : 0 < Total_2018
num2 : 0 < Total_2019 
violatedEdits(rule, new_pop) 
      edit
record  num1  num2
   1   FALSE FALSE
   2   FALSE FALSE
   3   FALSE FALSE
   4   FALSE FALSE
   5   FALSE FALSE
   6   FALSE FALSE
   7   FALSE FALSE
   8   FALSE FALSE
   9   FALSE FALSE
   10  FALSE FALSE
   11  FALSE FALSE
   12  FALSE FALSE
   13  FALSE FALSE
   14  FALSE FALSE
   15  FALSE FALSE
   16  FALSE FALSE
   17  FALSE FALSE
   18  FALSE FALSE
   19  FALSE FALSE
   20  FALSE FALSE
   21  FALSE FALSE
   22  FALSE FALSE
   23  FALSE FALSE
   24  FALSE FALSE
   25  FALSE FALSE
   26  FALSE FALSE
   27  FALSE FALSE
   28  FALSE FALSE
   29  FALSE FALSE
   30  FALSE FALSE
   31  FALSE FALSE
   32  FALSE FALSE
   33  FALSE FALSE
   34  FALSE FALSE
   35  FALSE FALSE
   36  FALSE FALSE
   37  FALSE FALSE
   38  FALSE FALSE
   39  FALSE FALSE
   40  FALSE FALSE
   41  FALSE FALSE
   42  FALSE FALSE
   43  FALSE FALSE
   44  FALSE FALSE
   45  FALSE FALSE
   46  FALSE FALSE
   47  FALSE FALSE
   48  FALSE FALSE
   49  FALSE FALSE
   50  FALSE FALSE
   51  FALSE FALSE
   52  FALSE FALSE
   53  FALSE FALSE
   54  FALSE FALSE
   55  FALSE FALSE
   56  FALSE FALSE
   57  FALSE FALSE
   58  FALSE FALSE
   59  FALSE FALSE
   60  FALSE FALSE
   61  FALSE FALSE
   62  FALSE FALSE
   63  FALSE FALSE
   64  FALSE FALSE
   65  FALSE FALSE
   66  FALSE FALSE
   67  FALSE FALSE
   68  FALSE FALSE
   69  FALSE FALSE
   70  FALSE FALSE
   71  FALSE FALSE
   72  FALSE FALSE
   73  FALSE FALSE
   74  FALSE FALSE
   75  FALSE FALSE
   76  FALSE FALSE
   77  FALSE FALSE
   78  FALSE FALSE
   79  FALSE FALSE
   80  FALSE FALSE
   81  FALSE FALSE
   82  FALSE FALSE
   83  FALSE FALSE
   84  FALSE FALSE
   85  FALSE FALSE
   86  FALSE FALSE
   87  FALSE FALSE
   88  FALSE FALSE
   89  FALSE FALSE
   90  FALSE FALSE
   91  FALSE FALSE
   92  FALSE FALSE
   93  FALSE FALSE
   94  FALSE FALSE
   95  FALSE FALSE
   96  FALSE FALSE
   97  FALSE FALSE
   98  FALSE FALSE
   99  FALSE FALSE
   100 FALSE FALSE
   101 FALSE FALSE
   102 FALSE FALSE
   103 FALSE FALSE
   104 FALSE FALSE
   105 FALSE FALSE
   106 FALSE FALSE
   107 FALSE FALSE
   108 FALSE FALSE
   109 FALSE FALSE
   110 FALSE FALSE
   111 FALSE FALSE
   112 FALSE FALSE
   113 FALSE FALSE
   114 FALSE FALSE
   115 FALSE FALSE
   116 FALSE FALSE
   117 FALSE FALSE
   118 FALSE FALSE
   119 FALSE FALSE
   120 FALSE FALSE
   121 FALSE FALSE
   122 FALSE FALSE
   123 FALSE FALSE
   124 FALSE FALSE
   125 FALSE FALSE
   126 FALSE FALSE
   127 FALSE FALSE
   128 FALSE FALSE
   129 FALSE FALSE
   130 FALSE FALSE
   131 FALSE FALSE
   132 FALSE FALSE
   133 FALSE FALSE
   134 FALSE FALSE
   135 FALSE FALSE
   136 FALSE FALSE
   137 FALSE FALSE
   138 FALSE FALSE
   139 FALSE FALSE
   140 FALSE FALSE
   141 FALSE FALSE
   142 FALSE FALSE
   143 FALSE FALSE
   144 FALSE FALSE
   145 FALSE FALSE
   146 FALSE FALSE
   147 FALSE FALSE
   148 FALSE FALSE
   149 FALSE FALSE
   150 FALSE FALSE
   151 FALSE FALSE
   152 FALSE FALSE
   153 FALSE FALSE
   154 FALSE FALSE
   155 FALSE FALSE
   156 FALSE FALSE
   157 FALSE FALSE
   158 FALSE FALSE
   159 FALSE FALSE
   160 FALSE FALSE

Scan II

Scan for number and proportions of residents by gender

Boxplot was used to scan numeric values from Estimated residents and Population ratio columns for any potential outliers in the dataset.

There were two outliers recorded in the boxplot of Estimated residents which responded for more than 150,000 residents across LGAs in Victoria.

The outliers were then detected by Z-score method using scores(type = "z").

The summary of z scores showed that the minimum z-score is -1.0408 and the maximum is 3.395.

Since the absolute values of z score greater than 3 are considered as outliers, which(abs(z.scores)>3) was used to identify the location of outliers.

  • The observation in row 14 and 94 were considered as outliers.
  • It highlighted that the LGA named Casey in Victoria had the most outstanding population which is more 150,000 residents by gender living in the area.
  • Since this output might be interesting for future questions, these outliers would not be eliminated.
boxplot(new_pop$`Estimated residents`, main="Number of residents in Victoria (2018)", ylab = "Residents")

z.scores <- new_pop$`Estimated residents` %>%  scores(type = "z")
z.scores %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1.0408 -0.8383 -0.4571  0.0000  0.7009  3.3958 
which(abs(z.scores) >3 )
[1] 14 94
new_pop[c(14,94), ]
length (which( abs(z.scores) >3 ))
[1] 2

There were four outliers recorded in the boxplot of Population ratio which was ranging from either more than 52% or less than 48% of residents living across LGAs.

The outliers were then detected by Z-score method using scores(type = "z").

The summary of z scores showed that the minimum z-score2 is -4.6315 and the maximum is 4.638060.

Since the absolute values of z score greater than 3 are considered as outliers, which(abs(z.scores)>3) was used to identify the location of outliers.

  • The observation in row 80 and 160 were considered as outliers.
  • It highlighted that the LGA named Unicorporated in Victoria which had the proportions of female residents lower than 48% and the proportions of male residents higher than 52%.
  • Unicorpotated Vic are areas without local gorvernment (Travel Victoria) at https://www.travelvictoria.com.au/victoria/localgovernment/.
  • Since these outliers are also meaningful, they would also not be eliminated from the dataset.
boxplot(new_pop$`Population ratio`, main="Proportions of residents in Victoria (2018)", ylab = "percentage")

z.scores2 <- new_pop$`Population ratio` %>%  scores(type = "z")
z.scores2 %>% summary()
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-4.631539 -0.594046  0.003261  0.000000  0.593007  4.638060 
which(abs(z.scores2) >3)
[1]  80 160
new_pop[c(80,160), ]

Scan for total population in 2018 and 2019

The total population in 2018 and 2019 across LGAs were subsetted from the dataset for scanning for multivariate data.

The Mahalanobis distance was used to detect multivariate outliers for the subset pop_sub dataset using mvn().

According to Chi-square Q-Q plot, there are 72 outliers dectected in the subset. These outliers would then be handled by transforming values.

pop_sub <- new_pop %>% select(`Total_2018`, `Total_2019`)
results <- mvn(data = pop_sub, multivariateOutlierMethod = "quan", showOutliers = TRUE)

The location of outliers were shown below:

results$multivariateOutliers

Boxplot was used to scan numeric values for Growth rate for any potential outliers in the dataset.

There were four outliers recorded in the boxplot of Estimated residents which indicated the growth rate of more than 4.0.

The outliers were then detected by Z-score method using scores(type = "z").

The summary of z scores showed that the minimum z-score is -1.7945 and the maximum is 3.1031.

Since the absolute values of z score greater than 3 are considered as outliers, which(abs(z.scores)>3) was used to identify the location of outliers.

  • The observation in row 76 and 156 were considered as outliers.
  • It highlighted that the LGA named Wyndham in Victoria had the highest growth rate of population from 2018 to 2019.
  • Since this output could bring interesting story for future analysis, these outliers would not be eliminated.
boxplot(new_pop$`Growth rate`, main = "Growth rate of LGAs in Victoria between 2018 and 2019")

z.scores3 <- new_pop$`Growth rate` %>%  scores(type = "z")
z.scores3 %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1.7945 -0.6311 -0.1422  0.0000  0.4218  3.1031 
which(abs(z.scores3) >3 )
[1]  76 156
new_pop[c(76,156), ]

Transform

The data distribution was checked for both Total_2018 and Total_2019 using hist().

Both of the distributions showed high right skewness in the variables. Therefore, \(ln\) transformation was taken into data transformation to reduce the skewness using log().

hist(new_pop$Total_2018, main = "Total population in 2018")

hist(new_pop$Total_2019, main = "Total population in 2019")

The distribution for Total_2018 and Total_2019 after \(ln\) transformation:

log_2018 <- log(new_pop$Total_2018)
hist(log_2018, main = "Log population in 2018")

log_2019 <- log(new_pop$Total_2019)
hist(log_2019, main = "Log population in 2019")

References

Australian Bureau of Statistics 2019, Population Estimates by Age and Sex, Local Government Areas (ASGS 2018), 2018, Australia’s statistic datasets, data file, Australian Government, Australian Bureau of Statistics, Melbourne, viewed 20 May 2020, https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3235.02018?OpenDocument.

Australian Bureau of Statistics 2019, Population Estimates by Local Government Area, 2018 to 2019 , Australia’s statistic datasets, data file, Australian Government, Australian Bureau of Statistics, Melbourne, viewed 20 May 2020, https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3218.02018-19?OpenDocument&fbclid=IwAR205_G_WwqXHZDy0SJxsXfJX_SbSNN4owAo_uYnsQ1ecqmhMsuTpaJ4xVw.

Parker, B 2002, Planning Analysis: Calculating Growth Rates, University of Oregon, viewed 30 May 2020, https://pages.uoregon.edu/rgp/PPPM613/class8a.htm#:~:text=The%20annual%20percentage%20growth%20rate,N%2C%20the%20number%20of%20years.&text=In%201980%2C%20the%20population%20in%20Lane%20County%20was%20250%2C000.

Travel Victoria 2020, Local Government Areas, Travel Victoria, viewed 30 May 2020, https://www.travelvictoria.com.au/victoria/localgovernment/.



---
title: "MATH2349 Semester 1, 2020"
author: "Thanh Ngoc Mai Tran and S3835399"
subtitle: Assignment 2
output:
  html_notebook: default
  html_document:
    df_print: paged
---

## Required packages 

```{r}
library(readxl)
library(tidyr)
library(dplyr)
library(editrules)
library(outliers)
library(MVN)
```


## Executive Summary 

The datasets of regional population by gender and age in 2018-2019 were obtained from Australian Bureau of Statistics to gain more insights about gender ratio in regional areas in Victoria and how their population changed from 2018 to 2019.

The first two datasets were imported as xlsx. files for the population estimates by sex across Local Government Areas (LGA) in 2018 and filtered for Victorian region. The final dataset described the total population estimates across LGA in Victoria in 2018-2019. All of the imported datasets were then joined using LGA as key variables and checked for data structure. Data conversion was taken from character to numeric values. The merged dataset was untidy because column headers were not variables so that tidying data was carried out for gender columns with the estimated number of residents as values.  Two new variables were created with mutate function to determine the population ratio by sex and the population growth between 2018-2019 for LGAs. 

The tidy dataset was then scanned for missing values and special values for all numeric values. The obvious errors of numeric data were also scanned for the total population of 2018 and 2019 to ensure that there were no negative values in the variables. The outliers were detected for following variables: Estimated residents, Population ratio and Growth rate using boxplot. The location of these outliers was identified using z.scores method for univariate data. The data from total population of 2018 and 2019 was scanned for outliers as multivariate data using Mahalanobis distance approach and visualised by Chi-square Q-Q plot. Following this, since the outliers were decided not to eliminate from the dataset, data transformation was taken for these two variables using $ln$ transformation to reduce the right skewness in the distributions. The final dataset was ready for further statistical analysis. 


## Data 

### First dataset

The Population Estimates by Age and Sex, Local Government Areas (ASGS 2018), 2018 .xls was collected from Australian Bureau of Statistics at https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3235.02018?OpenDocument&fbclid=IwAR0JOzaSa6clO2HYae-2ZbsdXZjNBuk5XK-Z5YEM0NhCuoCjW-fp8IkGbtQ, which measured the estimate of the residental population in Australia by age and gender corresponding to State and Local Government Areas at 30 June 2018.

The original dataset #1 was firstly imported with 7 skipped rows due to invalid headers and stored as `pop_female` for sheet named "Table 2".  

Since the study focused on total population by gender in Victoria, the S/T code under `ASGS 2018` for Victoria (S/T code = 2) was filtered from the original dataset using `filter()`. 

The column names of `vic_pop_female` were then renamed using `colnames()`.

Since the study would not consider the number of residents under Age group categories, only `Total Female population` per LGA  was selected from `vic_pop_female` using `select()` and assigned into new object called `data_female`. 

The `data_female` now contained four main variables:

* `S/T code` [categorical]: State/Territory code 
* `S/T name` [character]: State/Territory name
* `LGA code` [categorical]: code of Local Government Areas per State e.g 20110 for Alpine (S)
* `LGA Name` [character]: name of Local Government Areas per State e.g Alpine
* `Total Female population` [numeric]: the total of female residents from all age groups by LGA 

```{r}
pop_female <- read_excel("Population Estimates by Age and Sex, Local Government Areas (ASGS 2018), 2018 .xls", sheet = "Table 2", skip = 7)
vic_pop_female <- pop_female %>% filter(`ASGS 2018` == 2) 
colnames(vic_pop_female) <- c("S/T code", "S/T name", "LGA code", "LGA name", "Age 0-4", "Age 5-9", "Age 10-14", "Age 15-19", "Age 20-24", "Age 25-29", "Age 30-34", "Age 35-39", "Age 40-44", "Age 45-49", "Age 50-54", "Age 55-59", "Age 60-64", "Age 65-69", "Age 70-74", "Age 75-79", "Age 80-84", "Age 85 and over", "Total Female population")
```


```{r}
data_female <- vic_pop_female %>% select(`S/T code` , `S/T name`, `LGA code`, `LGA name`, `Total Female population`)
head(data_female)
```


### Second dataset 

The `pop_male` was also imported from Population Estimates by Age and Sex, Local Government Areas (ASGS 2018), 2018 .xls with sheet named "Table 1". 

Since the study focused on total population by gender in Victoria, the S/T code under `ASGS 2018` for Victoria (S/T code = 2) was also filtered from `pop_male` using `filter()`. 

The column names of `vic_pop_male` were also renamed using `colnames()`.

Since the study would not consider the number of residents under Age group categories, only `Total Male population` per LGA  was selected from `vic_pop_male` using `select()` and assigned into new object called `data_male`. 

The `data_female` now contained four main variables:

* `S/T code` [categorical]: State/Territory code 
* `S/T name` [character]: State/Territory name
* `LGA code` [categorical]: code of Local Government Areas per State e.g 20110 for Alpine (S)
* `LGA Name` [character]: name of Local Government Areas per State e.g Alpine
* `Total Male population` [numeric]: the total of female residents from all age groups by LGA 

```{r}
pop_male <- read_excel("Population Estimates by Age and Sex, Local Government Areas (ASGS 2018), 2018 .xls", sheet = "Table 1", skip = 7)
vic_pop_male <- pop_male %>% filter(`ASGS 2018` == 2) 
colnames(vic_pop_male) <- c("S/T code", "S/T name", "LGA code", "LGA name", "Age 0-4", "Age 5-9", "Age 10-14", "Age 15-19", "Age 20-24", "Age 25-29", "Age 30-34", "Age 35-39", "Age 40-44", "Age 45-49", "Age 50-54", "Age 55-59", "Age 60-64", "Age 65-69", "Age 70-74", "Age 75-79", "Age 80-84", "Age 85 and over", "Total Male population")
```


```{r}
data_male <- vic_pop_male %>% select(`S/T code` , `S/T name`, `LGA code`, `LGA name`, `Total Male population`)
head(data_male)
```

### Third dataset

The Population Estimates by Local Government Area, 2018 to 2019 .xls was also obtained from Australian Bureau of Statistics at https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3218.02018-19?OpenDocument&fbclid=IwAR205_G_WwqXHZDy0SJxsXfJX_SbSNN4owAo_uYnsQ1ecqmhMsuTpaJ4xVw, which indicated the total population for 2018 and 2019 as well as proportion for different components of population change (natural increase, net overseas migration, net internal migration) by LGA across States.

The original dataset #2 was firstly imported with sheet named "Table 2" (Victoria) and 6 skipped rows due to invalid headers and stored as `pop_vic`.

The total population of Victoria by LGA in 2018 and 2019 were selected from the original dataset `pop_vic` using `select()` and assigned into new object `new_pop_vic`for later merging datasets.

The column names of `new_pop_vic` was renamed and now includes three variables:

* `LGA code` [categorical]: code of Local Government Areas in Victoria
* `LGA name`[characer]: name of Local Government Areas in Victoria
* `2018`[numeric]: total population in 2018 by lGA
* `2019`[numeric]: total population in 2019 by lGA


```{r}
pop_vic <- read_excel("Population Estimates by Local Government Area, 2018 to 2019 .xls", sheet = "Table 2", skip = 6)
new_pop_vic <- pop_vic %>% select (c(1,2,3,4)) 
colnames(new_pop_vic) <- c("LGA code", "LGA name", "Total_2018", "Total_2019")
head(new_pop_vic)

```


### Merging datasets

The `data_female` and `data_male` were firstly merged using `inner_join` which used `S/T code`, `S/T name`,`LGA code` and `LGA name` as key variables and stored as `data_gender`.

* `Data_gender` described the total population by gender across Local Government Areas in Victoria by 2018.

```{r}
data_gender <- inner_join(data_female, data_male,by = c("S/T code", "S/T name", "LGA code", "LGA name"))
head(data_gender)
```

The final dataset `pop_vic_gender` was then created by merging `data_gender` and `new_pop_vic` using `inner_join` which used both `LGA code` and `LGA name` as key variables. 

The final dataset now includes 8 variables:

* `S/T code` [categorical]: State/Territory code 
* `S/T name` [character]: State/Territory name
* `LGA code` [categorical]: code of Local Government Areas per State e.g 20110 for Alpine (S)
* `LGA Name` [character]: name of Local Government Areas per State e.g Alpine
* `Total Female population` [numeric]: the total of female residents by LGA
* `Total Male population` [numeric]: the total of male residents by LGA
* `Total_2018`[numeric]: total population in 2018 by lGA
* `Total_2019`[numeric]: total population in 2019 by lGA

```{r}
pop_vic_gender <- inner_join(data_gender, new_pop_vic, by = c("LGA code", "LGA name"))
head(pop_vic_gender)
```

## Understand 

The dimension of the dataset was checked using `dim()`, which resulted at 80 observations and 8 variables.

The structure of `pop_vic_gender` showed that all variables were in character type. Therefore, the conversion to numeric data type and factor were needed.

* The conversion of multiple columns to numeric data type was done by using `sapply` with `as.numeric`.
* The conversion of `LGA code` to factor variable was done by using `as.factor` without ordered.


```{r}
dim(pop_vic_gender)
str(pop_vic_gender)

col.num <- c("S/T code", "Total Female population", "Total Male population", "Total_2018", "Total_2019")
pop_vic_gender[col.num] <- sapply(pop_vic_gender[col.num],as.numeric)
pop_vic_gender$`LGA code` <- as.factor(pop_vic_gender$`LGA code`)
```


```{r}
str(pop_vic_gender)

```


##	Tidy & Manipulate Data I 

The `pop_vic_gender` dataset was untidy because `Total Female population` and `Total Male population` as column headers were not variables. 

`gather()` was used to gather female and male columns into one single column named `Gender` and stored the values under `Estimated residents` column.

* `Gender` was then check whether it was a factor variable using `is.factor()`.
    + Since the variable was not a factor, the conversion from character to factor was taken by using `as.factor`.
    + The levels of `Gender` was classified using `factor` with labels.

* `Estimated residents` showed the number of residents by gender for each Local Government Areas on Victoria by 2018. 

A tidy dataset was then stored as `tidy_pop_gender`.

* The columns of the tidy dataset were rearranged with `Gender` and `Estimated residents` before `Total_2018` and `Total_2019`.



```{r}
tidy_pop_gender <- pop_vic_gender %>% gather(`Total Female population`, `Total Male population`,  key = "Gender", value = "Estimated residents")
head(tidy_pop_gender)
```


```{r}
is.factor(tidy_pop_gender$Gender)
tidy_pop_gender$Gender <- as.factor(tidy_pop_gender$Gender)
tidy_pop_gender$Gender<- factor(tidy_pop_gender$Gender, level = c("Total Female population", "Total Male population"), labels = c("female", "male"))
```


```{r}
tidy_pop_gender <- tidy_pop_gender[ , c(1,2,3,4,7,8,5,6)]
```

```{r}
head(tidy_pop_gender)
```

##	Tidy & Manipulate Data II 

The `Population ratio` was created to estimate the proportion of female and male residents by total population per Local Government Areas in Victoria by 2018. 

* Population ratio calculated by dividing the `Estimated residents` with total population in `2018` and converting into proportions.

* The ratio indicated that there were higher mean proportions of female residents across LGA in Victoria by 2018 using `group_by()` and `summarise()`. 

The `Growth rate` was also mutated to determine the growth rate of population between 2018 and 2019 for every Local Government Areas in Victoria.

* Growth rate calculated by following with the equation sourced from [University of Oregon](https://pages.uoregon.edu/rgp/PPPM613/class8a.htm#:~:text=The%20annual%20percentage%20growth%20rate,N%2C%20the%20number%20of%20years.&text=In%201980%2C%20the%20population%20in%20Lane%20County%20was%20250%2C000).

<center> Growth rate = $\frac{Present value - Past value}{Past value}*100$ 
</center>


* This growth rate value would be useful for predicting future population in different LGAs using an exponential growth model.

```{r echo=TRUE}
new_pop <- tidy_pop_gender %>% mutate("Population ratio" = round((`Estimated residents`/`Total_2018`)*100, 2), "Growth rate" = ((`Total_2019` - `Total_2018`)/`Total_2018`)*100)
head(new_pop)
```


```{r echo=TRUE}
new_pop %>% group_by(`Gender`) %>% summarise(avergage = mean(`Population ratio`, na.rm = TRUE))
```


##	Scan I 

### Scan for missing values

The missing values (NA) were scanned for all numeric values and output as the total number of missing values for each column using `sum(is.na())`. 

The `sapply()` was used to apply `sum(is.na())` in multiple columns for scanning.

There is no NA values detected in the dataset `new_pop`. 

```{r}
sapply(new_pop, function(x) sum(is.na(x)))
```

### Scan for special values

The `is.special` function is created to check for special values that are either infinite values or Not a Number values (nan) in the dataset.

The special values were scanned for all numeric data in `new_pop` dataset and the output as the total number of special values for each columns using `sum(is.special())`.

The `sapply()` was also used to apply `sum(is.special())` in multiple columns for scanning special values.

There is no special value detected in the dataset. 


```{r}
is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
```


```{r}
sapply(new_pop, function(x) sum(is.special(x)))
```

### Scan for obvivous errors

The data inconsistencies were checked for the dataset by generating a rule for numeric data. The rule is no negative numeric values in the total population of 2018 and 2019.

The rule was created using `editset()` and assigned it into `rule`.

The rule was then applied to the dataset using `violatedEdits()` to output logical values where TRUE is where inconsistencies are existed.

* The output turned out to have no data inconsistencies in the dataset. 
```{r}
rule <- editset(c("Total_2018 > 0", "Total_2019 > 0"))
rule
```
```{r}
violatedEdits(rule, new_pop) 
```


##	Scan II

###  Scan for number and proportions of residents by gender 

`Boxplot` was used to scan numeric values from `Estimated residents` and `Population ratio` columns for any potential outliers in the dataset.

There were two outliers recorded in the boxplot of `Estimated residents` which responded for more than 150,000 residents across LGAs in Victoria.

The outliers were then detected by Z-score method using `scores(type = "z")`.

The summary of z scores showed that the minimum z-score is -1.0408 and the maximum is 3.395.

Since the absolute values of z score greater than 3 are considered as outliers, `which(abs(z.scores)>3)` was used to identify the location of outliers.

* The observation in row 14 and 94 were considered as outliers.
* It highlighted that the LGA named Casey in Victoria had the most outstanding population which is more 150,000 residents by gender living in the area.
* Since this output might be interesting for future questions, these outliers would not be eliminated.
    
```{r, fig.width= 7}
boxplot(new_pop$`Estimated residents`, main="Number of residents in Victoria (2018)", ylab = "Residents")
```


```{r}
z.scores <- new_pop$`Estimated residents` %>%  scores(type = "z")
z.scores %>% summary()
```

```{r}
which(abs(z.scores) >3 )
new_pop[c(14,94), ]
```


```{r}
length (which( abs(z.scores) >3 ))
```


There were four outliers recorded in the boxplot of `Population ratio` which was ranging from either more than 52% or less than 48% of residents living across LGAs. 

The outliers were then detected by Z-score method using `scores(type = "z")`.

The summary of z scores showed that the minimum z-score2 is -4.6315 and the maximum is 4.638060.

Since the absolute values of z score greater than 3 are considered as outliers, `which(abs(z.scores)>3)` was used to identify the location of outliers.

* The observation in row 80 and 160 were considered as outliers.
* It highlighted that the LGA named Unicorporated in Victoria which had the proportions of female residents lower than 48% and the proportions of male residents higher than 52%.
* Unicorpotated Vic are areas without local gorvernment (Travel Victoria) at  https://www.travelvictoria.com.au/victoria/localgovernment/.
* Since these outliers are also meaningful, they would also not be eliminated from the dataset.

```{r}
boxplot(new_pop$`Population ratio`, main="Proportions of residents in Victoria (2018)", ylab = "percentage")
```


```{r}
z.scores2 <- new_pop$`Population ratio` %>%  scores(type = "z")
z.scores2 %>% summary()
```


```{r}
which(abs(z.scores2) >3)
```
```{r}
new_pop[c(80,160), ]
```

### Scan for total population in 2018 and 2019

The total population in 2018 and 2019 across LGAs were subsetted from the dataset for scanning for multivariate data. 

The Mahalanobis distance was used to detect multivariate outliers for the subset `pop_sub` dataset using `mvn()`.

According to Chi-square Q-Q plot, there are 72 outliers dectected in the subset. These outliers would then be handled by transforming values.

```{r}
pop_sub <- new_pop %>% select(`Total_2018`, `Total_2019`)
results <- mvn(data = pop_sub, multivariateOutlierMethod = "quan", showOutliers = TRUE)
```

The location of outliers were shown below:

```{r}
results$multivariateOutliers
```


`Boxplot` was used to scan numeric values for `Growth rate` for any potential outliers in the dataset.

There were four outliers recorded in the boxplot of `Estimated residents` which indicated the growth rate of more than 4.0.

The outliers were then detected by Z-score method using `scores(type = "z")`.

The summary of z scores showed that the minimum z-score is -1.7945 and the maximum is 3.1031.

Since the absolute values of z score greater than 3 are considered as outliers, `which(abs(z.scores)>3)` was used to identify the location of outliers.

* The observation in row 76 and 156 were considered as outliers.
* It highlighted that the LGA named Wyndham in Victoria had the highest growth rate of population from 2018 to 2019.
* Since this output could bring interesting story for future analysis, these outliers would not be eliminated.

```{r}
boxplot(new_pop$`Growth rate`, main = "Growth rate of LGAs in Victoria between 2018 and 2019")
```

```{r}
z.scores3 <- new_pop$`Growth rate` %>%  scores(type = "z")
z.scores3 %>% summary()
```

```{r}
which(abs(z.scores3) >3 )
new_pop[c(76,156), ]
```


##	Transform 

The data distribution was checked for both `Total_2018` and `Total_2019` using `hist()`.

Both of the distributions showed high right skewness in the variables. Therefore, $ln$ transformation was taken into data transformation to reduce the skewness using `log()`. 


```{r}
hist(new_pop$Total_2018, main = "Total population in 2018")
```


```{r}
hist(new_pop$Total_2019, main = "Total population in 2019")
```

The distribution for `Total_2018` and `Total_2019` after $ln$ transformation:

* The distribution for two variables were more symmetrical after data transformation.

```{r}
log_2018 <- log(new_pop$Total_2018)
hist(log_2018, main = "Log population in 2018")
```

```{r}
log_2019 <- log(new_pop$Total_2019)
hist(log_2019, main = "Log population in 2019")
```

## References

Australian Bureau of Statistics 2019, *Population Estimates by Age and Sex, Local Government Areas (ASGS 2018), 2018*,  Australia's statistic datasets, data file, Australian Government, Australian Bureau of Statistics, Melbourne, viewed 20 May 2020, <https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3235.02018?OpenDocument>. 

Australian Bureau of Statistics 2019, *Population Estimates by Local Government Area, 2018 to 2019 *,  Australia's statistic datasets, data file, Australian Government, Australian Bureau of Statistics, Melbourne, viewed 20 May 2020, <https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3218.02018-19?OpenDocument&fbclid=IwAR205_G_WwqXHZDy0SJxsXfJX_SbSNN4owAo_uYnsQ1ecqmhMsuTpaJ4xVw>. 

Parker, B 2002, *Planning Analysis: Calculating Growth Rates*, University of Oregon, viewed 30 May 2020, <https://pages.uoregon.edu/rgp/PPPM613/class8a.htm#:~:text=The%20annual%20percentage%20growth%20rate,N%2C%20the%20number%20of%20years.&text=In%201980%2C%20the%20population%20in%20Lane%20County%20was%20250%2C000>. 

Travel Victoria 2020, *Local Government Areas*, Travel Victoria, viewed 30 May 2020, <https://www.travelvictoria.com.au/victoria/localgovernment/>. 

<br>
<br>
