This report covers a statistical investigation and a series of business-related questions, based on a population dataset from the Australian Bureau of Statistics (ABS).
We will use a range of data analysis techniques such as descriptive statistics, data manipulation, and plotting in R, to accurately interpret and communicate the findings.
Setting up the knitr options for the report.
Installing and loading the packages you need to produce the report here:
#, message = FALSE
# Here are some example packages. You may add others if required
# install.packages("here")
# install.packages("readr")
# install.packages("knitr")
# install.packages("ggplot2")
# install.packages("dplyr")
# install.packages("magrittr")
# install.packages("plyr")
# install.packages("tidyr")
library(here) # For sensible file paths in the project folder
library(plyr)
library(readr) # For reading datasets in other formats
library(dplyr) # Useful for data manipulation
library(ggplot2) # Useful for building data visualisations
library(knitr) # Useful for creating nice tables
library(magrittr) # For pipes
library(tidyr)
The dataset is based on the population demographic data of different regions in Australia broken down by gender and age sourced from the Australian Bureau of Statistics (ABS) 2016 Census (population age has been truncated to end at 55 years).
auspop <- read.csv(
here::here("/Users/samuelklettnavarro/PY4E/MATH2406_codes/data_MATH2406",
"pop_dataset_0002.csv"))
head(auspop)
Let’s have a look at the structure of this dataset.
str(auspop)
## 'data.frame': 56000 obs. of 4 variables:
## $ region : chr "SSC21184" "SSC21184" "SSC21184" "SSC21184" ...
## $ age : int 0 0 1 1 2 2 3 3 4 4 ...
## $ gender : chr "M" "F" "M" "F" ...
## $ population: int 114 95 88 107 122 120 123 125 114 117 ...
We can see that there are not obvious missing values, as all the variables contain the same number for entries.
Looking at the variable region, before we transform it into a factor.
regions <- unique(auspop$region)
str(regions)
## chr [1:500] "SSC21184" "SSC20858" "SSC20503" "SSC20477" "SSC20330" ...
We can see that it has 500 values.
auspop$region %<>% factor(.,
levels = regions)
str(auspop$region)
## Factor w/ 500 levels "SSC21184","SSC20858",..: 1 1 1 1 1 1 1 1 1 1 ...
And now looking into factoring the genre column.
genres <- unique(auspop$gender)
str(genres)
## chr [1:2] "M" "F"
There are only two values, so we can look at transforming it into a factor, to facilitate the statistical analysis later on.
auspop$gender %<>% factor(.,
levels = genres)
str(auspop)
## 'data.frame': 56000 obs. of 4 variables:
## $ region : Factor w/ 500 levels "SSC21184","SSC20858",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ age : int 0 0 1 1 2 2 3 3 4 4 ...
## $ gender : Factor w/ 2 levels "M","F": 1 2 1 2 1 2 1 2 1 2 ...
## $ population: int 114 95 88 107 122 120 123 125 114 117 ...
Scanning the set for missing values.
colSums(is.na(auspop))
## region age gender population
## 0 0 0 0
Now we have the data ready for analysis.
We need to perform the following data analysis, by looking at some descriptive statistics on the complete dataset:
[Hint: Make sure you consider the population size at each age.]
First we select the data we need for this operation.
agepop <- auspop %>% select(-1, -3)
#head(agepop)
Now we can aggregate the values by grouping by age and then calculating the mean.
agepopg <- agepop %>%
group_by(age) %>%
dplyr::summarise(tot_pop_ag = sum(population))
#head(agepopg) # check
To calculate the mean, we create a new dataframe with the subset needed and use the handy rep() function, whichs use was contrasted with ref[1]
at1 <- rep(agepopg$age, agepopg$tot_pop_ag)
## expands the data by frequency of age
avg_age <- round(mean(at1), 1)
The mean age of the population is 27.8 .
avg_age_sd <- round(sd(at1), 1)
The standard deviation for the population is 15.8 .
In order to provide the age statistics for each region, we need to group the values by region and age, like we did before.
regpopag <- auspop %>% select(-3)
regpopagg <- regpopag %>%
group_by(region, age) %>%
dplyr::summarise(tot_pop = sum(population))
#head(regpopagg) # check
We can gather all the required values in one calculation, creating a new dataframe with the each one of the statistics as values.
sum_stats2 <- regpopagg %>%
group_by(region) %>%
dplyr::summarise(mean_age = round(mean(rep(age, tot_pop)), 1),
sd_age = round(sd(rep(age, tot_pop)), 1),
min_age = min(age),
first_q = quantile(rep(age, tot_pop), 0.25),
median_age = median(rep(age, tot_pop)),
third_q = quantile(rep(age, tot_pop), 0.75),
max_age = max(age),
iqr_age = IQR(rep(age, tot_pop))
)
head(sum_stats2, 3)
Although there are 500 rows of statistical analysis, so it is difficult to display the results for all of the regions.
Now calculating the overall mean values based on the totals for each region.
sum_stats29 <- sum_stats2 %>%
select(-1) %>%
dplyr::summarise(mean_age = round(mean(mean_age), 1),
sd_age = mean(sd_age),
min_age = min(min_age),
first_q = round(mean(first_q), 0),
median_age = median(median_age),
third_q = round(mean(third_q), 0),
max_age = max(max_age),
iqr_age = round(mean(iqr_age), 0)
)
sum_stats29
The mean age of the population (mean age overall = 27.8 ) is lower when calculated with the overall frequencies, than when using the averages for each region (mean age from regions means = 30.6 ). The standard deviation for the population, as expected from the mean results moving higher, is lower when calculated as the mean of means.
hist(sum_stats2$mean_age,
breaks = 30,
main = "Histogram of the regions' mean ages",
xlab = "Age",
col = "steelblue")
The main two justifications are that it looks symmetric and centered about its mean. If we generate a random normal distribution with the same parameters found in task one, and superimpose the real data over the generated, we can see the distribution follows a the normal curve.
seed <- 17
set.seed(seed)
nbr_x <- 500 # number regions
r_data <- rnorm(nbr_x, avg_age, avg_age_sd)
#head(r_data, 10) # looking at first 10 sampled values
ggplot() + labs(x = "Age",
y = 'Frequency',
title = "Histogram comparison between real and synthetic data") +
geom_histogram(aes(r_data)) +
stat_function(fun = dnorm,
geom = "area",
color = 'white',
fill = "lightgrey",
alpha = 0.2) +
geom_histogram(aes(sum_stats2$mean_age),
color = 'white',
fill = 'steelblue',
alpha = 0.8)
It is interesting to see that, given an expected flatter curve, much alike the generated from random data, due to the large standard deviation; it seems the real data is more to be collapsed in the mean, than the standard deviation would lead us to believe.
Consider the region with the largest population size:
For this let’s find the largest region population wise.
regpop <- regpopagg %>%
select(-2) %>%
group_by(region) %>%
dplyr::summarise(tot_pop = sum(tot_pop))
largest_region <- regpop[regpop$tot_pop == max(regpop$tot_pop), ]
largest_region
SSC22015 is the region with the largest population, in order to compare, we need to find some statistics for the whole group.
#head(regpop)
sum_stats31 <- regpop %>%
dplyr::summarise(mean_pop = mean(tot_pop),
sd_pop = sd(tot_pop),
min_pop = min(tot_pop),
first_q = quantile((tot_pop), 0.25),
median_pop = median(tot_pop),
third_q = quantile(tot_pop, 0.75),
max_pop = max(tot_pop),
iqr_pop = IQR(tot_pop)
)
sum_stats31
From the summary values, we can see that there are many regions with very low numbers of population, which means that in that territory, we should be able to find some regions where the population is clustered, given how far the mean is from the median, and also the small Interquartile Range (IQR). This region would be considered an outlier in the distribution, though it weighs a lot of importance in the set. We can have a look at how many other regions are outside of the IQR.
Displaying the age statistics from the table created in task 2.
SSC22015_age_stats <- sum_stats2[sum_stats2$region == "SSC22015",]
SSC22015_age_stats
For this it would be interesting to generate a table with the values for both, the region and the overall statistics, and long format it for easier plotting.
SSC22015_bind <- sum_stats2[sum_stats2$region == "SSC22015", -1]
sum_stats33b <- rbind(sum_stats29, SSC22015_bind)
sum_stats33b$region <- c("TOTALS", "SSC22015")
# long data for easier plotting
sum_stats33 <- sum_stats33b %>%
select(-3, -7) %>%
pivot_longer(., !region, names_to = "stat", values_to = "value")
sum_stats33b
Plotting the relevant results.
p33 <- ggplot(sum_stats33, aes(color=region, y=value, x=stat)) +
geom_point( aes(x=stat, y=value) )
p33 + theme(legend.position = "right",
axis.text = element_text(size = 10),
axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1),) +
xlab("") +
ylab("Values") +
ggtitle("Graphical statistical comparison")
We can clearly see that the region we are studying is 5 years younger than the total trend, which follows that the quartiles are also lower than the totals; together with the median difference and a similar IQR and standard deviation, it comes to say that this population center holds a younger population than other areas. This follows urban development trends, where younger cohorts tend to cluster around services, like universities or other education centers, and also, with easily accessible work for family members.
First subsetting the adequate table to produce the next plots.
SSC22015 <- auspop[auspop$region == "SSC22015",] %>%
select(-1)
# creating the Male dataframe
SSC22015_m <- SSC22015[SSC22015$gender == "M",]
hist(rep(SSC22015_m$age, SSC22015_m$population),
breaks = 30,
main = "Male population distribution in SSC22015",
xlab = "Age",
ylab = "Population",
col = "steelblue")
Obviously we are plotting a normal distribution.
In a similar manner than before.
SSC22015_f <- SSC22015[SSC22015$gender == "F",]
hist(rep(SSC22015_f$age, SSC22015_f$population),
breaks = 30,
main = "Histogram of the Female age in SSC22015",
xlab = "Age",
col = "red")
For a better and smoother comparison, we are going to use some polynomial fitting, which manually was found to be better at the 7th degree.
p36 = ggplot() +
geom_line(data = SSC22015, aes(x = age, y = population, color = gender)) +
geom_smooth(data = SSC22015_f,
aes(x = age, y = population),
color = "red",
method = "lm", formula = y ~ poly(x, 7), se = FALSE) +
geom_smooth(data = SSC22015_m,
aes(x = age, y = population),
color = "blue",
method = "lm", formula = y ~ poly(x, 7), se = FALSE) +
xlab('Age') +
ylab('Population') +
ggtitle("Male vs Female age distribution comparison") +
scale_color_manual(values=c("blue", "red"),
name="Gender",
labels=c("Male", "Female"))
p36
We can see the age distributions following the expected pattern, where there are higher level of males in the lower age bracket, only to be surpassed by females as the years increase.
Supposedly this has to do with males taken on more risky activities, whether socially or work wise, which result in a higher attrition throughout life, although normally we would see this at a later stage in life.
It is of interest to observe the drop in the level of teens, as if demographically the region had suffered a period of population loss, or perhaps aging, only to encounter a return of families and growth, probably socially and economically.
Also, the drop in elderly people seems to indicate some kind of movement or loss, it would be necessary to check the levels in adjacent regions to be able to figure out if they have been pushed out of the commercial center, which probably has higher prices, into surrounding areas.
Provide an analysis of trends in age against region population:
Q4 involves all regions – not just the one with the largest population size. Each region will have its own ratio of older to younger people. The question requires you to examine whether there is any relationship between a region’s age ratio and its population size
Creating the adequate table for this plot:
regpopagg$young_old <- ifelse(regpopagg$age < 40, "younger", "older")
regpopagg$young <- ifelse(regpopagg$young_old == "younger", 1, 0)
regpopagg$old <- ifelse(regpopagg$young_old == "older", 1, 0)
#regpopagg
If we create the summary statistics we need for the plot.
sum_stats41 <- regpopagg %>%
group_by(region) %>%
dplyr::summarise(young_pop = sum(rep(young, tot_pop)),
old_pop = sum(rep(old, tot_pop)),
total_pop = sum(tot_pop),
o2y_ratio = round(old_pop/young_pop, 3),
young_per = round(young_pop*100/total_pop, 1),
old_per = round(old_pop*100/total_pop, 1)
)
head(sum_stats41, 3)
Now, plotting the relationship with a scatter plot, we can see if there is any trends visually and if there are, perhaps a trend line will be added to aid the visualisation.
p41 <- ggplot(sum_stats41) +
geom_point(aes(x=total_pop, y=o2y_ratio), alpha=0.3) +
ggtitle("Old-to-young ratio vs population size for the regions") +
theme(legend.position="none") +
xlab("Population") + ylab("Old to young ratio")
With this plot, we cannot see a clear linear correlation, though it is clear that most of the regions with older population are small.
We might be able to observe a better relationship using normalisation.
min_max_norm <- function(x, na.rm = FALSE) {
(x - min(x, na.rm = na.rm)) / (max(x, na.rm = na.rm) - min(x,
na.rm = na.rm))
}
sum_stats41_norm <- sum_stats41 %>%
select(4:5) %>%
mutate(pop_norm = min_max_norm(total_pop))
#head(sum_stats41_norm, 2)
Plotting with the normalised values.
p43 <- ggplot(sum_stats41_norm) +
geom_point( aes(x=pop_norm, y=o2y_ratio), alpha= 0.2 ) +
xlab("Normalised total population") +
ylab("Ratio") +
ggtitle("Old-to-young ratio vs regional population size")
p433 <- ggExtra::ggMarginal(p43, type = "histogram")
p433
Provide an analysis of trends in the ratio of males to females against region population:
As we did in the previous task, first, we need to create some dummy variables to be able to generate the required statistics.
sum_stats5 <- auspop %>% select(-2)
sum_stats5$Female <- ifelse(sum_stats5$gender == 'F', 1, 0)
sum_stats5$Male <- ifelse(sum_stats5$gender == 'M', 1, 0)
#head(sum_stats5, 3)
Now generating the stats for each region.
sum_stats51 <- sum_stats5 %>%
select(-2) %>%
group_by(region) %>%
dplyr::summarise(F_pop = sum(rep(Female, population)),
M_pop = sum(rep(Male, population)),
total_pop = sum(population),
gender_ratio = round(M_pop/F_pop, 3),
F_per = round(F_pop*100/total_pop, 1),
M_per = round(M_pop*100/total_pop, 1)
)
head(sum_stats51, 3)
As with the age ratio, we are going to try to visualise a trend through a scatter plot, in this case we are going to add histograms corresponding to each axis, to visualise the distributions.
p51 <- ggplot(sum_stats51) +
geom_point(aes(x=total_pop, y=gender_ratio), alpha=0.3) +
ggtitle("Gender ratio (M/F) vs regional population size") +
theme(legend.position="none") +
xlab("Population") + ylab("M/F ratio")
p511 <- ggExtra::ggMarginal(p51, type = "histogram")
p511
It is clear that the regions with higher ratios of Male population fall in the regions with the least population; although there are also a lot of lightly populated regions with a ration below parity.
There seems to be a correlation between parity and region size, in the fact that without it, regions don’t seem to increase in total population. It is noticeable, as the marginal histograms indicate, that most of the population falls within a range of [1.5-0] ratio.
Also, the point at the extreme of the population axis is the region that has been studied previously, showcasing to sit in parity; as expected from the population distributions provided above in 3.6.
It could be said too that the binomial distribution, which this plot represents as a 2-D projection of a 3-D surface, is skewed both to the right and to the top, near the y axis and the parity line between genders.
Imagine you have enough financial resources for launching a new energy drink in any two regions:
Given the figures showed previously in Task 5 and 3 , choosing male or female at that age falls slightly towards the female population, thus our choice.
e_d_mean <- (avg_age + sum_stats29$mean_age) / 2
age_interval <- 5
min_int <- round(e_d_mean - (age_interval)/2, 0)
max_int <- round(e_d_mean + (age_interval)/2, 0)
Knowing that the mean population age is 27.8, when calculated with the overall frequencies, than when using the averages for each region is 30.6, if we take the mean point between the two and use it as the centre of the interval for a 5 year bracket, we have our age group between [27 - 32]. (<- THIS IS WHERE THE ERROR WAS, I REALISED I HAD PUT THESE VALUES BEFORE THE CALCULATION)
In order to have the maximum captive market for launch, we would choose the regions with the maximum female population within the age range selected.
sum_stats6 <- auspop %>%
filter(gender == "F") %>%
filter(age >= min_int) %>%
filter(age <= max_int)
sum_stats62 <- sum_stats6 %>%
select(-2, -3) %>%
group_by(region) %>%
dplyr::summarise(target_pop = sum(population) )
#sum_stats62
launch_regions <- sum_stats62 %>%
arrange(desc(target_pop))
launch_regions[1:2,]
1- you believe that 15% of your primary target market in the region will attend the launch. Use this assumption to estimate the numbers of the primary target market that you expect to attend in each region.
For the expected attendance.
exp_att <- launch_regions %>%
mutate(exp_att = round(target_pop*15/100,0))
exp_att[1:2,]
2- Also estimate the likelihood that 30% of the primary target market will attend in each region. Explain your reasoning for both estimates. If we assume a Poisson distribution, given that we have discrete units. For region exp_att[1,1], we have.
# Let X denote the number expected attendies
lambda1 <- exp_att[1,3]
x30 <- exp_att[1,3]*2
#reg1 <- ppois(x30, lambda1, lower.tail = FALSE)
#round(reg1, 3)
Somehow, even though the values appear in the environment, this last equation does not work.
[1] Unknown (2014). ‘R computing mean, median, variance from file with frequency distribution’. Viewed May 18, 2022, from Stackoverflow: https://stackoverflow.com/questions/22644481/r-computing-mean-median-variance-from-file-with-frequency-distribution
[2] Marsja, E (2020). ‘How to Create Dummy Variables in R (with Examples)’. Viewed May 21, 2022: https://www.marsja.se/create-dummy-variables-in-r/
4.2 Comment on any trends you see in the data. What could explain such trends?
The desired effect of spreading the points, by normalising of the population variable with a min-max function, was not achieved, so we are not able to gain further insights in the trends. There were attempts made to transform the ratio, to no avail, due to a number of the values being close to, or actually, zero. If we had more data for older segments of the population, we would have been able to showcase more detail in this plot; another option would have been to generate a cut off value in the range between [30-35], for this particular problem to provide better values.
As expressed above, it is clear that the regions with higher ratios of older population are in the very low end of the total population living in that area.
On the other hand, we can see that the ratio for younger cohorts is very similar across regions of sizes ranging from: small to large; though there is an important cluster of thinly populated and balanced regions. The marginal histograms indicate that most of the regions are well below the 1:1 ratio, thus displaying an overall younger population, which matches the values of the mean explained above.
Also, the point at the extreme of the plot is the region that has been studied previously, which visually can be seen not to have the lowest ratio, although seems to sit just below the average; as expected from the population distributions provided above.
It could be said too that the binomial distribution, which this plot represents as a 2-D projection of a 3-D surface, is skewed both to the right and to the top, near the y axis and the 50-50 old to young line.