Lu Teck Hii s3939509
Last updated: 16 October, 2022
Rpubs link comes here: https://rpubs.com/Luke97/A2-math1234
This online version of the presentation will be used for marking. Failure to add your link will delay your feedback and risk late penalties.
According to Sustainable Population Australia(2022), the population of Australia has been increasing for the last 20 years. As Victoria is one of the major distribution of population states in Australia, the population growth of Victoria should be increased.
There are many different segmentation of the population. Would the population growth could be affected by each other within these segments? If so, what is the relation of the population within the different segments in Victoria in these few years?To further investigate, gender as one of the demographic segments is selected. This report investigates the population of Victoria by gender from 2016 to 2021. The dataset comprises the population of Australia by gender and state. However, This report would concentrate on Victoria as the target of proof of concept.
This report would be focusing on the pattern of the population by gender in Victoria. Summary statistics, visualizations, and Hypothesis testing are used to answer the questions while statistics and visualizations ensure the understanding of the data, and Hypothesis testing investigates the answer if there is any relation of the population between Male and Female.
Columns of data (after pre-processing):
- Male : This column contains the Population of Male in
Victoria.
- Female : This column contains the Population of
Female in Victoria.
- Year : This column contains the Year of the
populations.
#Pre-processing
# In order to execute the code with the data spreadsheet, you have to download the spreadsheet from the source and rename it as data.xlsx
data <- read_excel("data.xlsx", sheet = "Table_5") # Read data from spreadsheet
# Selected required data only
male<-data[c(13:18),1:10]
female<-data[c(38:43),1:10]
# Update the column Name
colname <- c(data[4,2:10])
new_colname <-c('NSW','VIC','QLD','SA','WA','TAS','NT','ACT','Total')
colnames(male) <- c("Year",new_colname)
colnames(female) <- c("Year",new_colname)
# omit na
female<-na.omit(female)
male<-na.omit(male)
# convert to suitable data type
index <- c(2:10)
female[index] <- sapply(female[index],as.numeric)
male[index] <- sapply(male[index],as.numeric)
# filtering- select required columns
male_VIC <- male %>% select(Year,VIC)
female_VIC <- female %>% select(Year,VIC)
# rename column
colnames(male_VIC) <- c('Year','Male')
colnames(female_VIC) <- c('Year','Female')
#merge dataset
new_data <- full_join(male_VIC,female_VIC,by='Year')
#divide the population by 1000, this is because of make a more readable visualisation for the plots
new_data <- new_data %>% mutate(Male = Male/1000, Female=Female/1000) # So the data label or legend would not show something like this 1+e01
new_data -> table1
knitr::kable(table1)%>% kable_styling(bootstrap_options = "condensed",position='center')| Year | Male | Female |
|---|---|---|
| 2016 | 3081.062 | 3152.918 |
| 2017 | 3141.518 | 3212.749 |
| 2018 | 3202.623 | 3271.049 |
| 2019 | 3257.027 | 3326.378 |
| 2020 | 3244.995 | 3318.470 |
| 2021 | 3243.517 | 3316.424 |
The data has been summarised by Gender. We could see that all the statistics values of Male and Female are close, which means that the portion of the population in terms of gender is considered balanced. Besides, there is no missing value in the data set.
new_data1<-new_data %>% gather('Male','Female', key = 'Gender', value = 'Population')
#change gender from char to factor
new_data1$Gender<-as.factor(new_data1$Gender)
levels(new_data1[,2])<-c("Male","female")
#By Gender
new_data1 %>% group_by(Gender) %>% summarise(
Mean= mean(Population, na.rm = TRUE),
Median = median(Population,na.rm = TRUE),
SD = sd(Population,na.rm = TRUE),
Q1=quantile(Population, c(0.25)),
Q3=quantile(Population, c(0.75)),
IQR = IQR(Population,na.rm = TRUE),
Min = min(Population,na.rm = TRUE),
Max = max(Population,na.rm = TRUE),
Missing = sum(is.na(Population)))-> table1
knitr::kable(table1)%>% kable_styling(bootstrap_options = "condensed",position='center',font_size =20 )| Gender | Mean | Median | SD | Q1 | Q3 | IQR | Min | Max | Missing |
|---|---|---|---|---|---|---|---|---|---|
| Female | 3266.331 | 3293.736 | 70.12735 | 3227.324 | 3317.959 | 90.63450 | 3152.918 | 3326.378 | 0 |
| Male | 3195.124 | 3223.070 | 70.14408 | 3156.794 | 3244.626 | 87.83125 | 3081.062 | 3257.027 | 0 |
#data for line plot
line_data <-new_data %>% gather('Male','Female', key = 'Gender', value = 'Population')
#data for bar plot
bar_data<-mutate(new_data,Total = Male+Female)
bar_data <-bar_data %>% gather('Male','Female','Total', key = 'Gender', value = 'Population')
#data for pie plot
pie_data <- line_data %>% filter(Year =='2021')
#box plot
box_data <-new_data %>% gather('Male','Female', key = 'Gender', value = 'Population')#Plots
ggplot(data=line_data, aes(x=Year, y=Population, group=Gender)) +
geom_line(aes(color=Gender))+
geom_point(aes(color=Gender))
From the line graph, we could see that the shape of population growth
for Male and Female are same except that the increment or decrement
might be slightly different. Other than that we could also observe that
female is about 50,000 greater than male in the period of time (noted
that the population is divided by 1000 in pre-processing section). Other
than that, looking at the patterns of the population, the population has
grown constantly from 2016 to 2019 and started to decrease slowly
onwards.
ggplot(pie_data, aes(x='', y=Population, fill=Gender )) +
geom_col() +
geom_label(aes(label = Population), color = "white",
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
coord_polar(theta = "y")+
theme_void()
The Pie chart indicates the population by gender in 2021. Female has a
slightly greater proportion than Male but it is not obverse. #
Descriptive Statistics cont.
From the boxplot, there are two significants, first, we could see that
there is no outlier, which means that no further modification is
required. Second, Female has a greater population in general than
male.
## [1] 1 4
## [1] 1 4
QQ-plot is used to check if both male and female data are distributed normally. This is because the size of this dataset is very small, histogram might not suitable for checking the data distribution. Therefore the Q-Qplot is used instead. For the normality testing by Q-Q plot, we could see that all the data points for both Male and Female are located within the lines, which means that the data for both Male and Female are considered normally distributed.
For the Hypothesis testing, a two-tailed, two-sample t-test was used to test if there is a significant difference in the population between Male and Female.To begin with, the assumptions have to be checked. The assumption of normality is checked in the previous section by using a Q-Q plot and the data is distributed normally.While for the assumption of equal variances, it could be checked by Levene’s test. From the result of Levene’s test shown below, we could see that the P value, 0.98 is greater than 0.05 so the assumption is passed.
leveneTest(Population ~ Gender, data = new_data1)-> table1
knitr::kable(table1)%>% kable_styling(bootstrap_options = "condensed",position='center')| Df | F value | Pr(>F) | |
|---|---|---|---|
| group | 1 | 0.0006255 | 0.9805383 |
| 10 | NA | NA |
After assumptions are checked, the statistical hypotheses were to be
stated:
H0: μΔ=0(Male Population is related to Female population)
HA: μΔ≠0(Male Population is not related to Female population)
Denoted the significance level, α \[ α= 0.05 \]
result <- t.test(
Population ~ Gender,
data = new_data1,
var.equal = TRUE,
alternative = "two.sided"
)
result ##
## Two Sample t-test
##
## data: Population by Gender
## t = 1.7585, df = 10, p-value = 0.1092
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
## -19.01608 161.43141
## sample estimates:
## mean in group Female mean in group Male
## 3266.331 3195.124
## t
## 1.758523
## [1] -2.228139
## [1] 0.1091644
## [1] -19.01608 161.43141
## attr(,"conf.level")
## [1] 0.95
The result of the two-sample t-test has shown that the P value, 0.1092 is greater than α,0.05 and 95% CI of difference between means [-19.016, 161.431] which has captured H0 between the means. Therefore, the decision would be not to reject H:0 Mu1 =Mu2.
In this report, we could see that the population pattern of Male and Female seems very similar. This could be observed from the statistics and visualisations section. Plots have shown that the pattern and amount of the population between Male and Female are alike. However, the result of the hypothesis testing of H0 was rejected. This indicates that the pattern, amount of population between Male and Female is not related.
The limitation of the t.test would be the amount of group could be tested at one time. Considering there are about ten groups are required to be tested. t-test does not allows us to do so, the maximum group of sample could be tested would be limited to two.
From the result of this investigation, we could have an idea of there might not have relations within the same segmentation such as gender or age group. However, there might have relations across different segmentation. Given an example that the population of gender might be related to the population of income group.
To conclude the report, although the pattern and amount of the population between male and female seems alike, there is not any relationship between male population and female population that could affect each other.
[1] Baglin, J., 2022. Module 7 - Testing the Null: Data on Trial. [online] Applied Analytics Course Website. Available at: https://astral-theory-157510.appspot.com/secured/MATH1324_Module_07.html [Accessed 12 October 2022].
[2] Sustainable Population Australia, 2022. Australia’s Population. [online] Sustainable Population Australia. Available at: https://population.org.au/about-population/australias-population/ [Accessed 1 October 2022].
[3] Australian Bureau of Statistics, 2022 National, state and territory population. [online] Australian Bureau of Statistics. Available at: https://www.abs.gov.au/statistics/people/population/national-state-and-territory-population/mar-2022/ [Accessed 4 October 2022].
[4]Xie, Y., Allaire, J.J., GrolemundR, G., 2022, Markdown: The Definitive Guide. [online]bookdown.org. Available at: https://bookdown.org/yihui/rmarkdown/slidy-presentation.html [Accessed 16 October 2022].