Assignment 2 - Statistical analysis of Population in Victoria

RPub link information

This is The RPubs link: https://rpubs.com/Luke97/956616

Introduction

Population growth is one of the critical considerations for the government or other organizations when it comes to planning purposes. As Australia is one of the most developed counties in the world, city and town planning is one of the very important fields people are concentrated in. The needs and living patterns could vary from different demographic segmentation, which could lead to totally different planning. Therefore, it is important to understand the trend of the population in a region of Australia for planning purposes, no matter government or individuals.

Problem Statement

According to Sustainable Population Australia(2022), the population of Australia has been increasing for the last 20 years. As Victoria is one of the major distribution of population states in Australia, the population growth of Victoria should be increased.

There are many different segmentation of the population. Would the population growth could be affected by each other within these segments? If so, what is the relation of the population within the different segments in Victoria in these few years?To further investigate, gender as one of the demographic segments is selected. This report investigates the population of Victoria by gender from 2016 to 2021. The dataset comprises the population of Australia by gender and state. However, This report would concentrate on Victoria as the target of proof of concept.

This report would be focusing on the pattern of the population by gender in Victoria. Summary statistics, visualizations, and Hypothesis testing are used to answer the questions while statistics and visualizations ensure the understanding of the data, and Hypothesis testing investigates the answer if there is any relation of the population between Male and Female.

Data

The data set is published by Australian Bureau of Statistic. This data set could be found in ‘National, state and territory population’ table with the link provided below. The data used in this assignment is located in the spreadsheet “table_5”.

data source: https://www.abs.gov.au/statistics/people/population/national-state-and-territory-population/mar-2022#data-downloads-data-cubes

The pre-processing progress has been done with R. The code for all the steps of pre-processing the data are included in the next section with detailed comments explaining what action is taken to pre-process the data. However, a brief explanation would be provided:

Read data from the spreadsheet
Select required data only from specific fields (Male and Female are stored separately at the beginning).
Rename the column name to a more suitable name.
Omit na if there is any.
Convert data type for columns (Population for all States: character to numeric).
Select only required data by using select function (Victoria and year).
Merge Male and Female with join function and rename the columns as below.

*Noted that further modification would be made in terms of the structure of the dataset in order to perform statistics and visualisations.

Columns of data (after pre-processing):
- Male : This column contains the Population of Male in Victoria.
- Female : This column contains the Population of Female in Victoria.
- Year : This column contains the Year of the populations.

Descriptive Statistics and Visualisation

Load Packages

library(readxl)
library(ggplot2) 
library(dplyr)  
library(tidyr)
library(car)

Preprocessing

# Pre-processing
# Read data from spreadsheet
# In order to execute the code with the data spreadsheet, you have to download the spreadsheet from the source and rename it as data.xlsx
data <- read_excel("data.xlsx", sheet = "Table_5")

# Selected required data only
male<-data[c(13:18),1:10]
female<-data[c(38:43),1:10]

# Update the column Name
colname <- c(data[4,2:10])
new_colname <-c('NSW','VIC','QLD','SA','WA','TAS','NT','ACT','Total')
colnames(male) <- c("Year",new_colname)
colnames(female) <- c("Year",new_colname)

# omit na
female<-na.omit(female)
male<-na.omit(male)

# convert to suitable data type
index <- c(2:10)
female[index] <- sapply(female[index],as.numeric)
male[index] <- sapply(male[index],as.numeric)

# filtering- select required columns
male_VIC <- male %>% select(Year,VIC)
female_VIC <- female %>% select(Year,VIC)

# rename column
colnames(male_VIC) <- c('Year','Male')
colnames(female_VIC) <- c('Year','Female')

#merge dataset
new_data <- full_join(male_VIC,female_VIC,by='Year')

#divide the population by 1000, this is because of make a more readable visualisation for the plots
# So the data label or legend would not show something like this 1+e01 
new_data <- new_data %>% mutate(Male = Male/1000, Female=Female/1000)
new_data

## # A tibble: 6 x 3
##   Year   Male Female
##   <chr> <dbl>  <dbl>
## 1 2016  3081.  3153.
## 2 2017  3142.  3213.
## 3 2018  3203.  3271.
## 4 2019  3257.  3326.
## 5 2020  3245.  3318.
## 6 2021  3244.  3316.

Data Descriptive Statistics

The data has been summarised by Gender. We could see that all the statistics values of Male and Female are close, which means that the portion of the population in terms of gender is considered balanced. Besides, there is no missing value in the data set.

new_data1<-new_data %>% gather('Male','Female', key = 'Gender', value = 'Population')
#change gender from char to factor
new_data1$Gender<-as.factor(new_data1$Gender)
levels(new_data1[,2])<-c("Male","female")
#By Gender
new_data1 %>% group_by(Gender) %>% summarise(
  Mean= mean(Population, na.rm = TRUE),
  Median = median(Population,na.rm = TRUE),
  SD = sd(Population,na.rm = TRUE),
  Q1=quantile(Population, c(0.25)),
  Q3=quantile(Population, c(0.75)),
  IQR = IQR(Population,na.rm = TRUE),
  Min = min(Population,na.rm = TRUE),
  Max = max(Population,na.rm = TRUE),
  Missing = sum(is.na(Population)))

## # A tibble: 2 x 10
##   Gender  Mean Median    SD    Q1    Q3   IQR   Min   Max Missing
##   <fct>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <int>
## 1 Female 3266.  3294.  70.1 3227. 3318.  90.6 3153. 3326.       0
## 2 Male   3195.  3223.  70.1 3157. 3245.  87.8 3081. 3257.       0

Data Visualization

For the data visualisation, there are four types of plots would be used to demonstrate the understand of the data distribution and patterns.

#data for line plot
line_data <-new_data %>% gather('Male','Female', key = 'Gender', value = 'Population')
#data for bar plot
bar_data<-mutate(new_data,Total = Male+Female)
bar_data <-bar_data %>% gather('Male','Female','Total', key = 'Gender', value = 'Population')
#data for pie plot
pie_data <- line_data %>% filter(Year =='2021')
#box plot
box_data <-new_data %>% gather('Male','Female', key = 'Gender', value = 'Population')

#Plots
ggplot(data=line_data, aes(x=Year, y=Population, group=Gender)) +
  geom_line(aes(color=Gender))+
  geom_point(aes(color=Gender))

From the line graph, we could see that the shape of population growth for Male and Female are same except that the increment or decrement might be slightly different. Other than that we could also observe that female is about 50,000 greater than male in the period of time (noted that the population is divided by 1000 in pre-processing section). Other than that, looking at the patterns of the population, the population has grown constantly from 2016 to 2019 and started to decrease slowly onwards.

ggplot(pie_data, aes(x='', y=Population, fill=Gender )) +
  geom_col() +
  geom_label(aes(label = Population), color = "white",
             position = position_stack(vjust = 0.5),
             show.legend = FALSE) +
  coord_polar(theta = "y")+
  theme_void()

The Pie chart shows the proportion of the population by gender in 2021. We could see that female has a slightly greater proportion than Male but it is not obverse.

box_data %>% boxplot(Population ~ Gender, data = ., ylab = "Population")

From the boxplot, there are two significants, first, we could see that there is no outlier, which means that no further modification is required. Second, Female has a greater population in general than male.

test_male <-new_data1 %>% filter(Gender == 'Male') 
test_male$Population%>% qqPlot(dist="norm")

## [1] 1 4

test_female<-new_data1 %>% filter(Gender == 'Female') 
test_female$Population%>% qqPlot(dist="norm")

## [1] 1 4

QQ-plot is used to check if both male and female data are distributed normally. This is because the size of this dataset is very small, histogram might not suitable for checking the data distribution. Therefore the Q-Qplot is used instead. For the normality testing by Q-Q plot, we could see that all the data points for both Male and Female are located within the lines, which means that the data for both Male and Female are considered normally distributed.

Hypothesis Testing

For the Hypothesis testing, a two-tailed, two-sample t-test was used to test if there is a significant difference in the population between Male and Female.To begin with, the assumptions have to be checked. The assumption of normality is checked in the previous section by using a Q-Q plot and the data is distributed normally.While for the assumption of equal variances, it could be checked by Levene’s test. From the result of Levene’s test shown below, we could see that the P value, 0.98 is greater than 0.05 so the assumption is passed.

leveneTest(Population ~ Gender, data = new_data1)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1   6e-04 0.9805
##       10

After assumptions are checked, the statistical hypotheses were to be stated:
H0: μΔ=0(Male Population is related to Female population)
HA: μΔ≠0(Male Population is not related to Female population)

\[H_0 : \mu_1 = \mu_2 \] \[H_A: \mu_1 \ne \mu_2 \] Denoted the significance level, α \[ α= 0.05 \]

result <- t.test(
Population ~ Gender,
data = new_data1,
var.equal = TRUE,
alternative = "two.sided"
)
result

## 
##  Two Sample t-test
## 
## data:  Population by Gender
## t = 1.7585, df = 10, p-value = 0.1092
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -19.01608 161.43141
## sample estimates:
## mean in group Female   mean in group Male 
##             3266.331             3195.124

# Displaying the value of t statistics:
result$statistic

##        t 
## 1.758523

# The critical value:
qt(p = 0.05/2 , df = 10 )

## [1] -2.228139

# Displaying p-value:
result$p.value

## [1] 0.1091644

# Displaying confidence interval:
result$conf.int

## [1] -19.01608 161.43141
## attr(,"conf.level")
## [1] 0.95

The result of the two-sample t-test has shown that the P value, 0.1092 is greater than α,0.05 and 95% CI of difference between means [-19.016, 161.431] which has captured H0 between the means. Therefore, the decision would be not to reject H:0 Mu1 =Mu2.

Discussion

In this report, we could see that the population pattern of Male and Female seems very similar. This could be observed from the statistics and visualisations section. Plots have shown that the pattern and amount of the population between Male and Female are alike. However, the result of the hypothesis testing of H0 was rejected. This indicates that the pattern, amount of population between Male and Female is not related.From the result of this investigation, we could have an idea of there might not have relations within the same segmentation such as gender or age group. However, there might have relations across different segmentation. Given an example that the population of gender might be related to the population of income group. To conclude the report, although the pattern and amount of the population between male and female seems alike, there is not any relationship between male population and female population that could affect each other.

References

[1] Baglin, J., 2022. Module 7 - Testing the Null: Data on Trial. [online] Applied Analytics Course Website. Available at: https://astral-theory-157510.appspot.com/secured/MATH1324_Module_07.html [Accessed 12 October 2022].

[2] Sustainable Population Australia, 2022. Australia’s Population. [online] Sustainable Population Australia. Available at: https://population.org.au/about-population/australias-population/ [Accessed 1 October 2022].

[3] Australian Bureau of Statistics, 2022 National, state and territory population. [online] Australian Bureau of Statistics. Available at: https://www.abs.gov.au/statistics/people/population/national-state-and-territory-population/mar-2022/ [Accessed 4 October 2022].