Overview

This report covers a statistical investigation and a series of business-related questions, based on a population dataset from the Australian Bureau of Statistics (ABS).

We will use a range of data analysis techniques such as descriptive statistics, data manipulation, and plotting in R, to accurately interpret and communicate the findings.

Setup

Setting up the knitr options for the report.

Installing and loading the packages you need to produce the report here:

#, message = FALSE

# Here are some example packages. You may add others if required
# install.packages("here")
# install.packages("readr")
# install.packages("knitr")
# install.packages("ggplot2")
# install.packages("dplyr")
# install.packages("magrittr")
# install.packages("plyr")
# install.packages("tidyr")

library(here)     # For sensible file paths in the project folder
library(plyr) 
library(readr)    # For reading datasets in other formats 
library(dplyr)    # Useful for data manipulation
library(ggplot2)  # Useful for building data visualisations
library(knitr)    # Useful for creating nice tables
library(magrittr) # For pipes
library(tidyr)

Loading the data

The dataset is based on the population demographic data of different regions in Australia broken down by gender and age sourced from the Australian Bureau of Statistics (ABS) 2016 Census (population age has been truncated to end at 55 years).

auspop <- read.csv(
  here::here("/Users/samuelklettnavarro/PY4E/MATH2406_codes/data_MATH2406", 
       "pop_dataset_0002.csv"))

head(auspop)

Let’s have a look at the structure of this dataset.

str(auspop)
## 'data.frame':    56000 obs. of  4 variables:
##  $ region    : chr  "SSC21184" "SSC21184" "SSC21184" "SSC21184" ...
##  $ age       : int  0 0 1 1 2 2 3 3 4 4 ...
##  $ gender    : chr  "M" "F" "M" "F" ...
##  $ population: int  114 95 88 107 122 120 123 125 114 117 ...

We can see that there are not obvious missing values, as all the variables contain the same number for entries.

Data manipulation

Looking at the variable region, before we transform it into a factor.

regions <- unique(auspop$region)
str(regions)
##  chr [1:500] "SSC21184" "SSC20858" "SSC20503" "SSC20477" "SSC20330" ...

We can see that it has 500 values.

auspop$region %<>% factor(., 
                         levels = regions)
str(auspop$region)
##  Factor w/ 500 levels "SSC21184","SSC20858",..: 1 1 1 1 1 1 1 1 1 1 ...

And now looking into factoring the genre column.

genres <- unique(auspop$gender)
str(genres)
##  chr [1:2] "M" "F"

There are only two values, so we can look at transforming it into a factor, to facilitate the statistical analysis later on.

auspop$gender %<>% factor(., 
                         levels = genres)
str(auspop)
## 'data.frame':    56000 obs. of  4 variables:
##  $ region    : Factor w/ 500 levels "SSC21184","SSC20858",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ age       : int  0 0 1 1 2 2 3 3 4 4 ...
##  $ gender    : Factor w/ 2 levels "M","F": 1 2 1 2 1 2 1 2 1 2 ...
##  $ population: int  114 95 88 107 122 120 123 125 114 117 ...

Scanning the set for missing values.

colSums(is.na(auspop))
##     region        age     gender population 
##          0          0          0          0

Now we have the data ready for analysis.

Task 1

We need to perform the following data analysis, by looking at some descriptive statistics on the complete dataset:

[Hint: Make sure you consider the population size at each age.]

1.1 Find the mean age of all people included in the dataset.

First we select the data we need for this operation.

agepop <- auspop %>% select(-1, -3)
#head(agepop)

Now we can aggregate the values by grouping by age and then calculating the mean.

agepopg <-  agepop %>%
  group_by(age) %>%
  dplyr::summarise(tot_pop_ag = sum(population))

#head(agepopg) # check

To calculate the mean, we create a new dataframe with the subset needed and use the handy rep() function, whichs use was contrasted with ref[1]

at1 <- rep(agepopg$age, agepopg$tot_pop_ag)  
## expands the data by frequency of age

avg_age <- round(mean(at1), 1)

The mean age of the population is 27.8 .

1.2 Find the standard deviation of all people included in the dataset.

avg_age_sd <- round(sd(at1), 1)

The standard deviation for the population is 15.8 .

Task 2

In the section we need to record the following statistics based on the means of each region:

2.1 mean

2.2 standard deviation

2.3 minimum

2.4 first quartile

2.5 median

2.6 third quartile

2.7 maximum

2.8 interquartile range

In order to provide the age statistics for each region, we need to group the values by region and age, like we did before.

regpopag <- auspop %>% select(-3)

regpopagg <-  regpopag %>%
  group_by(region, age) %>%
  dplyr::summarise(tot_pop = sum(population))

#head(regpopagg) # check

We can gather all the required values in one calculation, creating a new dataframe with the each one of the statistics as values.

sum_stats2 <- regpopagg %>% 
  group_by(region) %>% 
  dplyr::summarise(mean_age = round(mean(rep(age, tot_pop)), 1), 
            sd_age = round(sd(rep(age, tot_pop)), 1),
            min_age = min(age),
            first_q = quantile(rep(age, tot_pop), 0.25),
            median_age = median(rep(age, tot_pop)), 
            third_q = quantile(rep(age, tot_pop), 0.75), 
            max_age = max(age),
            iqr_age = IQR(rep(age, tot_pop))
  )
head(sum_stats2, 3)

Although there are 500 rows of statistical analysis, so it is difficult to display the results for all of the regions.

Now calculating the overall mean values based on the totals for each region.

sum_stats29 <- sum_stats2 %>%
  select(-1) %>% 
  dplyr::summarise(mean_age = round(mean(mean_age), 1),
            sd_age = mean(sd_age),
            min_age = min(min_age),
            first_q = round(mean(first_q), 0),
            median_age = median(median_age), 
            third_q = round(mean(third_q), 0),
            max_age = max(max_age),
            iqr_age = round(mean(iqr_age), 0)
  )

sum_stats29

The mean age of the population (mean age overall = 27.8 ) is lower when calculated with the overall frequencies, than when using the averages for each region (mean age from regions means = 30.6 ). The standard deviation for the population, as expected from the mean results moving higher, is lower when calculated as the mean of means.

2.9 histogram of the distribution of region means

hist(sum_stats2$mean_age,
     breaks = 30,
     main = "Histogram of the regions' mean ages",
     xlab = "Age",
     col = "steelblue")

2.10 Discuss whether the distribution of region means exhibits the characteristic shape of a normal distribution.

The main two justifications are that it looks symmetric and centered about its mean. If we generate a random normal distribution with the same parameters found in task one, and superimpose the real data over the generated, we can see the distribution follows a the normal curve.

seed <- 17 
set.seed(seed)

nbr_x <- 500 # number regions 
r_data <- rnorm(nbr_x, avg_age, avg_age_sd) 
#head(r_data, 10) # looking at first 10 sampled values

ggplot() +   labs(x = "Age", 
       y = 'Frequency', 
       title = "Histogram comparison between real and synthetic data") + 
  geom_histogram(aes(r_data)) + 
  stat_function(fun = dnorm, 
                geom = "area", 
                color = 'white',
                fill = "lightgrey", 
                alpha = 0.2) + 
  geom_histogram(aes(sum_stats2$mean_age), 
                 color = 'white', 
                 fill = 'steelblue', 
                 alpha = 0.8)

It is interesting to see that, given an expected flatter curve, much alike the generated from random data, due to the large standard deviation; it seems the real data is more to be collapsed in the mean, than the standard deviation would lead us to believe.

Task 3

Consider the region with the largest population size:

3.1 Identify the region and describe its population size in comparison with the other regions.

For this let’s find the largest region population wise.

regpop <-  regpopagg %>% 
  select(-2) %>% 
  group_by(region) %>%
  dplyr::summarise(tot_pop = sum(tot_pop))

largest_region <- regpop[regpop$tot_pop == max(regpop$tot_pop), ]
largest_region

SSC22015 is the region with the largest population, in order to compare, we need to find some statistics for the whole group.

#head(regpop)
sum_stats31 <- regpop %>% 
  dplyr::summarise(mean_pop = mean(tot_pop), 
            sd_pop = sd(tot_pop),
            min_pop = min(tot_pop),
            first_q = quantile((tot_pop), 0.25),
            median_pop = median(tot_pop), 
            third_q = quantile(tot_pop, 0.75), 
            max_pop = max(tot_pop),
            iqr_pop = IQR(tot_pop)
  )

sum_stats31

From the summary values, we can see that there are many regions with very low numbers of population, which means that in that territory, we should be able to find some regions where the population is clustered, given how far the mean is from the median, and also the small Interquartile Range (IQR). This region would be considered an outlier in the distribution, though it weighs a lot of importance in the set. We can have a look at how many other regions are outside of the IQR.

3.2 Provide summary statistics for the region.

Displaying the age statistics from the table created in task 2.

SSC22015_age_stats <- sum_stats2[sum_stats2$region == "SSC22015",]
SSC22015_age_stats

3.3 How does the age distribution for the region compare with the distribution of means provided in Q2?

For this it would be interesting to generate a table with the values for both, the region and the overall statistics, and long format it for easier plotting.

SSC22015_bind <- sum_stats2[sum_stats2$region == "SSC22015", -1]

sum_stats33b <- rbind(sum_stats29, SSC22015_bind)

sum_stats33b$region <- c("TOTALS", "SSC22015")
# long data for easier plotting
sum_stats33 <- sum_stats33b %>% 
  select(-3, -7) %>% 
  pivot_longer(., !region, names_to = "stat", values_to =  "value")

sum_stats33b

Plotting the relevant results.

p33 <- ggplot(sum_stats33, aes(color=region, y=value, x=stat)) + 
  geom_point( aes(x=stat, y=value) )
p33 + theme(legend.position = "right", 
            axis.text = element_text(size = 10), 
            axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1),) + 
  xlab("") + 
  ylab("Values") + 
  ggtitle("Graphical statistical comparison")

We can clearly see that the region we are studying is 5 years younger than the total trend, which follows that the quartiles are also lower than the totals; together with the median difference and a similar IQR and standard deviation, it comes to say that this population center holds a younger population than other areas. This follows urban development trends, where younger cohorts tend to cluster around services, like universities or other education centers, and also, with easily accessible work for family members.

3.4 Plot the distribution of age for males in the region.

First subsetting the adequate table to produce the next plots.

SSC22015 <- auspop[auspop$region == "SSC22015",] %>% 
  select(-1)
# creating the Male dataframe
SSC22015_m <- SSC22015[SSC22015$gender == "M",]

hist(rep(SSC22015_m$age, SSC22015_m$population),
     breaks = 30,
     main = "Male population distribution in SSC22015",
     xlab = "Age",
     ylab = "Population",
     col = "steelblue")

Obviously we are plotting a normal distribution.

3.5 Plot the distribution of age for females in the region.

In a similar manner than before.

SSC22015_f <- SSC22015[SSC22015$gender == "F",]

hist(rep(SSC22015_f$age, SSC22015_f$population),
     breaks = 30,
     main = "Histogram of the Female age in SSC22015",
     xlab = "Age",
     col = "red")

3.6 Compare the distributions of Q3.4 and Q3.5 and discuss your findings.

For a better and smoother comparison, we are going to use some polynomial fitting, which manually was found to be better at the 7th degree.

p36 = ggplot() + 
  geom_line(data = SSC22015, aes(x = age, y = population, color = gender)) +
  geom_smooth(data = SSC22015_f, 
              aes(x = age, y = population), 
              color = "red", 
              method = "lm", formula = y ~ poly(x, 7), se = FALSE) + 
  geom_smooth(data = SSC22015_m, 
              aes(x = age, y = population), 
              color = "blue", 
              method = "lm", formula = y ~ poly(x, 7), se = FALSE) +
  xlab('Age') +
  ylab('Population') + 
  ggtitle("Male vs Female age distribution comparison") + 
  scale_color_manual(values=c("blue", "red"), 
                       name="Gender",
                       labels=c("Male", "Female"))
p36

We can see the age distributions following the expected pattern, where there are higher level of males in the lower age bracket, only to be surpassed by females as the years increase.

Supposedly this has to do with males taken on more risky activities, whether socially or work wise, which result in a higher attrition throughout life, although normally we would see this at a later stage in life.

It is of interest to observe the drop in the level of teens, as if demographically the region had suffered a period of population loss, or perhaps aging, only to encounter a return of families and growth, probably socially and economically.

Also, the drop in elderly people seems to indicate some kind of movement or loss, it would be necessary to check the levels in adjacent regions to be able to figure out if they have been pushed out of the commercial center, which probably has higher prices, into surrounding areas.

Task 4

Provide an analysis of trends in age against region population:

Q4 involves all regions – not just the one with the largest population size. Each region will have its own ratio of older to younger people. The question requires you to examine whether there is any relationship between a region’s age ratio and its population size

4.1 Plot the ratio of older to younger people, where ‘younger’ is defined as aged below 40 and ‘older’ as aged 40 and above.

Creating the adequate table for this plot:

regpopagg$young_old <- ifelse(regpopagg$age < 40, "younger", "older")

regpopagg$young <- ifelse(regpopagg$young_old == "younger", 1, 0)
regpopagg$old <- ifelse(regpopagg$young_old == "older", 1, 0)
#regpopagg

If we create the summary statistics we need for the plot.

sum_stats41 <- regpopagg %>% 
  group_by(region) %>% 
  dplyr::summarise(young_pop = sum(rep(young, tot_pop)),
            old_pop = sum(rep(old, tot_pop)),
            total_pop = sum(tot_pop),
            o2y_ratio = round(old_pop/young_pop, 3),
            young_per = round(young_pop*100/total_pop, 1), 
            old_per = round(old_pop*100/total_pop, 1)
  )

head(sum_stats41, 3)

Now, plotting the relationship with a scatter plot, we can see if there is any trends visually and if there are, perhaps a trend line will be added to aid the visualisation.

p41 <-  ggplot(sum_stats41) + 
    geom_point(aes(x=total_pop, y=o2y_ratio), alpha=0.3) + 
    ggtitle("Old-to-young ratio vs population size for the regions") +
    theme(legend.position="none") +
    xlab("Population") + ylab("Old to young ratio")

With this plot, we cannot see a clear linear correlation, though it is clear that most of the regions with older population are small.

We might be able to observe a better relationship using normalisation.

min_max_norm <- function(x, na.rm = FALSE) {
  (x - min(x, na.rm = na.rm)) / (max(x, na.rm = na.rm) - min(x,
na.rm = na.rm))
}
sum_stats41_norm <- sum_stats41 %>% 
  select(4:5) %>% 
  mutate(pop_norm = min_max_norm(total_pop))
#head(sum_stats41_norm, 2)

Plotting with the normalised values.

p43 <- ggplot(sum_stats41_norm) + 
  geom_point( aes(x=pop_norm, y=o2y_ratio), alpha= 0.2 ) + 
  xlab("Normalised total population") + 
  ylab("Ratio") + 
  ggtitle("Old-to-young ratio vs regional population size")
p433 <- ggExtra::ggMarginal(p43, type = "histogram")
p433

Task 5

Provide an analysis of trends in the ratio of males to females against region population:

As we did in the previous task, first, we need to create some dummy variables to be able to generate the required statistics.

sum_stats5 <- auspop %>% select(-2)

sum_stats5$Female <- ifelse(sum_stats5$gender == 'F', 1, 0)
sum_stats5$Male <- ifelse(sum_stats5$gender == 'M', 1, 0)
#head(sum_stats5, 3)

Now generating the stats for each region.

sum_stats51 <- sum_stats5 %>% 
  select(-2) %>% 
  group_by(region) %>% 
  dplyr::summarise(F_pop = sum(rep(Female, population)),
            M_pop = sum(rep(Male, population)),
            total_pop = sum(population),
            gender_ratio = round(M_pop/F_pop, 3),
            F_per = round(F_pop*100/total_pop, 1), 
            M_per = round(M_pop*100/total_pop, 1)
  )
head(sum_stats51, 3)

5.1 Plot the ratio against the population by region.

As with the age ratio, we are going to try to visualise a trend through a scatter plot, in this case we are going to add histograms corresponding to each axis, to visualise the distributions.

p51 <-  ggplot(sum_stats51) + 
    geom_point(aes(x=total_pop, y=gender_ratio), alpha=0.3) + 
    ggtitle("Gender ratio (M/F) vs regional population size") +
    theme(legend.position="none") +
    xlab("Population") + ylab("M/F ratio")
p511 <- ggExtra::ggMarginal(p51, type = "histogram")
p511

Task 6

Imagine you have enough financial resources for launching a new energy drink in any two regions:

6.1 Select a gender and age group which spans 3 to 5 years. This will be the primary target market for your hypothetical energy drink.

Given the figures showed previously in Task 5 and 3 , choosing male or female at that age falls slightly towards the female population, thus our choice.

e_d_mean <- (avg_age + sum_stats29$mean_age) / 2
age_interval <- 5
min_int <- round(e_d_mean - (age_interval)/2, 0)
max_int <- round(e_d_mean + (age_interval)/2, 0)

Knowing that the mean population age is 27.8, when calculated with the overall frequencies, than when using the averages for each region is 30.6, if we take the mean point between the two and use it as the centre of the interval for a 5 year bracket, we have our age group between [27 - 32]. (<- THIS IS WHERE THE ERROR WAS, I REALISED I HAD PUT THESE VALUES BEFORE THE CALCULATION)

6.2 Which two regions would you choose? Explain your reasoning.

In order to have the maximum captive market for launch, we would choose the regions with the maximum female population within the age range selected.

sum_stats6 <- auspop %>% 
  filter(gender == "F") %>% 
  filter(age >= min_int) %>% 
  filter(age <= max_int)
sum_stats62 <- sum_stats6 %>% 
  select(-2, -3) %>% 
  group_by(region) %>% 
  dplyr::summarise(target_pop = sum(population) )
#sum_stats62
launch_regions <- sum_stats62 %>% 
  arrange(desc(target_pop))
launch_regions[1:2,]

6.3 In planning each region’s campaign launch

1- you believe that 15% of your primary target market in the region will attend the launch. Use this assumption to estimate the numbers of the primary target market that you expect to attend in each region.

For the expected attendance.

exp_att <- launch_regions %>% 
  mutate(exp_att = round(target_pop*15/100,0))
exp_att[1:2,]

2- Also estimate the likelihood that 30% of the primary target market will attend in each region. Explain your reasoning for both estimates. If we assume a Poisson distribution, given that we have discrete units. For region exp_att[1,1], we have.

# Let X denote the number expected attendies

lambda1 <- exp_att[1,3]   
x30 <- exp_att[1,3]*2  

#reg1 <- ppois(x30, lambda1, lower.tail = FALSE) 
#round(reg1, 3)

Somehow, even though the values appear in the environment, this last equation does not work.

Reference