Problem Statement

According to Sustainable Population Australia(2022), the population of Australia has been increasing for the last 20 years. As Victoria is one of the major distribution of population states in Australia, the population growth of Victoria should be increased.

There are many different segmentation of the population. Would the population growth could be affected by each other within these segments? If so, what is the relation of the population within the different segments in Victoria in these few years?To further investigate, gender as one of the demographic segments is selected. This report investigates the population of Victoria by gender from 2016 to 2021. The dataset comprises the population of Australia by gender and state. However, This report would concentrate on Victoria as the target of proof of concept.

This report would be focusing on the pattern of the population by gender in Victoria. Summary statistics, visualizations, and Hypothesis testing are used to answer the questions while statistics and visualizations ensure the understanding of the data, and Hypothesis testing investigates the answer if there is any relation of the population between Male and Female.

Data

The data set is published by Australian Bureau of Statistic. This data set could be found in ‘National, state and territory population’ table with the link provided below. The data used in this assignment is located in the spreadsheet “table_5”.
Data source: https://www.abs.gov.au/statistics/people/population/national-state-and-territory-population/mar-2022#data-downloads-data-cubes
The pre-processing progress has been done with R. The code for all the steps of pre-processing the data are included in the next section with detailed comments explaining what action is taken to pre-process the data.
However, a brief explanation would be provided:
1. Read data from the spreadsheet
2. Select required data only from specific fields (Male and Female are stored separately at the beginning).
3. Rename the column name to a more suitable name.
4. Omit na if there is any.
5. Convert data type for columns (Population for all States: character to numeric).
6. Select only required data by using select function (Victoria and year).
7. Merge Male and Female with join function and rename the columns as below.
Noted that further modification would be made in terms of the structure of the dataset in order to perform statistics and visualisations.

Data cont.

Columns of data (after pre-processing):
- Male : This column contains the Population of Male in Victoria.
- Female : This column contains the Population of Female in Victoria.
- Year : This column contains the Year of the populations.

Descriptive Statistics and Visualisation

#Pre-processing
# In order to execute the code with the data spreadsheet, you have to download the spreadsheet from the source and rename it as data.xlsx
data <- read_excel("data.xlsx", sheet = "Table_5")  # Read data from spreadsheet
# Selected required data only
male<-data[c(13:18),1:10]
female<-data[c(38:43),1:10]
# Update the column Name
colname <- c(data[4,2:10])
new_colname <-c('NSW','VIC','QLD','SA','WA','TAS','NT','ACT','Total')
colnames(male) <- c("Year",new_colname)
colnames(female) <- c("Year",new_colname)
# omit na
female<-na.omit(female)
male<-na.omit(male)
# convert to suitable data type
index <- c(2:10)
female[index] <- sapply(female[index],as.numeric)
male[index] <- sapply(male[index],as.numeric)
# filtering- select required columns
male_VIC <- male %>% select(Year,VIC)
female_VIC <- female %>% select(Year,VIC)
# rename column
colnames(male_VIC) <- c('Year','Male')
colnames(female_VIC) <- c('Year','Female')
#merge dataset
new_data <- full_join(male_VIC,female_VIC,by='Year')
#divide the population by 1000, this is because of make a more readable visualisation for the plots 
new_data <- new_data %>% mutate(Male = Male/1000, Female=Female/1000) #  So the data label or legend would not show something like this 1+e01
new_data -> table1
knitr::kable(table1)%>%  kable_styling(bootstrap_options = "condensed",position='center')

Year	Male	Female
2016	3081.062	3152.918
2017	3141.518	3212.749
2018	3202.623	3271.049
2019	3257.027	3326.378
2020	3244.995	3318.470
2021	3243.517	3316.424

Descriptive Statistics cont.

Data Descriptive Statistics

The data has been summarised by Gender. We could see that all the statistics values of Male and Female are close, which means that the portion of the population in terms of gender is considered balanced. Besides, there is no missing value in the data set.

new_data1<-new_data %>% gather('Male','Female', key = 'Gender', value = 'Population')
#change gender from char to factor
new_data1$Gender<-as.factor(new_data1$Gender)
levels(new_data1[,2])<-c("Male","female")
#By Gender
new_data1 %>% group_by(Gender) %>% summarise(
  Mean= mean(Population, na.rm = TRUE),
  Median = median(Population,na.rm = TRUE),
  SD = sd(Population,na.rm = TRUE),
  Q1=quantile(Population, c(0.25)),
  Q3=quantile(Population, c(0.75)),
  IQR = IQR(Population,na.rm = TRUE),
  Min = min(Population,na.rm = TRUE),
  Max = max(Population,na.rm = TRUE),
  Missing = sum(is.na(Population)))-> table1
knitr::kable(table1)%>%  kable_styling(bootstrap_options = "condensed",position='center',font_size =20 )

Gender	Mean	Median	SD	Q1	Q3	IQR	Min	Max	Missing
Female	3266.331	3293.736	70.12735	3227.324	3317.959	90.63450	3152.918	3326.378	0
Male	3195.124	3223.070	70.14408	3156.794	3244.626	87.83125	3081.062	3257.027	0

Data preparation for Visualization

#data for line plot
line_data <-new_data %>% gather('Male','Female', key = 'Gender', value = 'Population')
#data for bar plot
bar_data<-mutate(new_data,Total = Male+Female)
bar_data <-bar_data %>% gather('Male','Female','Total', key = 'Gender', value = 'Population')
#data for pie plot
pie_data <- line_data %>% filter(Year =='2021')
#box plot
box_data <-new_data %>% gather('Male','Female', key = 'Gender', value = 'Population')

Descriptive Statistics cont.

Data Visualisation

#Plots
ggplot(data=line_data, aes(x=Year, y=Population, group=Gender)) +
  geom_line(aes(color=Gender))+
  geom_point(aes(color=Gender))

From the line graph, we could see that the shape of population growth for Male and Female are same except that the increment or decrement might be slightly different. Other than that we could also observe that female is about 50,000 greater than male in the period of time (noted that the population is divided by 1000 in pre-processing section). Other than that, looking at the patterns of the population, the population has grown constantly from 2016 to 2019 and started to decrease slowly onwards.

ggplot(pie_data, aes(x='', y=Population, fill=Gender )) +
  geom_col() +
  geom_label(aes(label = Population), color = "white",
             position = position_stack(vjust = 0.5),
             show.legend = FALSE) +
        coord_polar(theta = "y")+
  theme_void()

The Pie chart indicates the population by gender in 2021. Female has a slightly greater proportion than Male but it is not obverse. # Descriptive Statistics cont.

box_data %>% boxplot(Population ~ Gender, data = ., ylab = "Population")

From the boxplot, there are two significants, first, we could see that there is no outlier, which means that no further modification is required. Second, Female has a greater population in general than male.

test_male <-new_data1 %>% filter(Gender == 'Male') 
test_male$Population%>% qqPlot(dist="norm")

## [1] 1 4

test_female<-new_data1 %>% filter(Gender == 'Female') 
test_female$Population%>% qqPlot(dist="norm")

## [1] 1 4

QQ-plot is used to check if both male and female data are distributed normally. This is because the size of this dataset is very small, histogram might not suitable for checking the data distribution. Therefore the Q-Qplot is used instead. For the normality testing by Q-Q plot, we could see that all the data points for both Male and Female are located within the lines, which means that the data for both Male and Female are considered normally distributed.

Hypothesis Testing

For the Hypothesis testing, a two-tailed, two-sample t-test was used to test if there is a significant difference in the population between Male and Female.To begin with, the assumptions have to be checked. The assumption of normality is checked in the previous section by using a Q-Q plot and the data is distributed normally.While for the assumption of equal variances, it could be checked by Levene’s test. From the result of Levene’s test shown below, we could see that the P value, 0.98 is greater than 0.05 so the assumption is passed.

leveneTest(Population ~ Gender, data = new_data1)-> table1
knitr::kable(table1)%>%  kable_styling(bootstrap_options = "condensed",position='center')

	Df	F value	Pr(>F)
group	1	0.0006255	0.9805383
	10	NA	NA

After assumptions are checked, the statistical hypotheses were to be stated:
H0: μΔ=0(Male Population is related to Female population)
HA: μΔ≠0(Male Population is not related to Female population)

Here are the examples of mathematical equations: \[H_0 : \mu_1 = \mu_2 \] \[H_A: \mu_1 \ne \mu_2 \] \[S = \sum^n_{i = 1}d^2_i\]

Hypothesis Testing cont.

Denoted the significance level, α \[ α= 0.05 \]

result <- t.test(
Population ~ Gender,
data = new_data1,
var.equal = TRUE,
alternative = "two.sided"
)
result

## 
##  Two Sample t-test
## 
## data:  Population by Gender
## t = 1.7585, df = 10, p-value = 0.1092
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -19.01608 161.43141
## sample estimates:
## mean in group Female   mean in group Male 
##             3266.331             3195.124

# Displaying the value of t statistics:
result$statistic

##        t 
## 1.758523

# The critical value:
qt(p = 0.05/2 , df = 10 )

## [1] -2.228139

# Displaying p-value:
result$p.value

## [1] 0.1091644

# Displaying confidence interval:
result$conf.int

## [1] -19.01608 161.43141
## attr(,"conf.level")
## [1] 0.95

The result of the two-sample t-test has shown that the P value, 0.1092 is greater than α,0.05 and 95% CI of difference between means [-19.016, 161.431] which has captured H0 between the means. Therefore, the decision would be not to reject H:0 Mu1 =Mu2.

Discussion

In this report, we could see that the population pattern of Male and Female seems very similar. This could be observed from the statistics and visualisations section. Plots have shown that the pattern and amount of the population between Male and Female are alike. However, the result of the hypothesis testing of H0 was rejected. This indicates that the pattern, amount of population between Male and Female is not related.
The limitation of the t.test would be the amount of group could be tested at one time. Considering there are about ten groups are required to be tested. t-test does not allows us to do so, the maximum group of sample could be tested would be limited to two.
From the result of this investigation, we could have an idea of there might not have relations within the same segmentation such as gender or age group. However, there might have relations across different segmentation. Given an example that the population of gender might be related to the population of income group.
To conclude the report, although the pattern and amount of the population between male and female seems alike, there is not any relationship between male population and female population that could affect each other.

References

[1] Baglin, J., 2022. Module 7 - Testing the Null: Data on Trial. [online] Applied Analytics Course Website. Available at: https://astral-theory-157510.appspot.com/secured/MATH1324_Module_07.html [Accessed 12 October 2022].

[2] Sustainable Population Australia, 2022. Australia’s Population. [online] Sustainable Population Australia. Available at: https://population.org.au/about-population/australias-population/ [Accessed 1 October 2022].

[3] Australian Bureau of Statistics, 2022 National, state and territory population. [online] Australian Bureau of Statistics. Available at: https://www.abs.gov.au/statistics/people/population/national-state-and-territory-population/mar-2022/ [Accessed 4 October 2022].

[4]Xie, Y., Allaire, J.J., GrolemundR, G., 2022, Markdown: The Definitive Guide. [online]bookdown.org. Available at: https://bookdown.org/yihui/rmarkdown/slidy-presentation.html [Accessed 16 October 2022].

Assignment 2

Statistical analysis of Population in Victoria

RPubs link information

Introduction

Problem Statement

Data

Data cont.

Descriptive Statistics and Visualisation

Descriptive Statistics cont.

Data Descriptive Statistics

Data preparation for Visualization

Descriptive Statistics cont.

Data Visualisation

Hypothesis Testing

Hypothesis Testing cont.

Discussion

References