MATH2406 Applied Analytics

Introduction

The following project is an introduction to statistical data analysis. The aim of the project is to learn aspects of statistics such as mean, median and standard deviation and probability models such as Normal (or Gaussian) distribution using various functions and visualisations using the language ‘R’. By examining a real-life dataset from the Australian Bureau of Statistics (ABS) the project will evaluate population data around Age, Gender, and Regional location, in order to predict where a marketing campaign for a new Energy Drink will be most successful.

This analysis project uses the following typical practices of statistical analysis:

Setup

Load the ABS data to being the analysis work. The short data snippet below shows the format of the dataset.

Task 1 - Overall Age analysis

The first step is to perform high-level analysis across the entire dataset. This will provide an overall understanding of the total population.

Initially we will examine the mean age of all people and the standard deviation of all people included in the dataset.

These statistics are achieved through the following important functions in R -

group_by - this is used to group data in a dataframe. For this scenario we will group by ‘Age’ and count the totals for each individual age found in the dataset.
fmean - this is used to calculate the ‘mean’ within the dataset and uses a weighting for each population.
fmedian - this is used to calculate the ‘median’ within the dataset.
ggplot::geombar - this is used to generate a useful visialisation of the grouped data and presents a quick way to better understand the overall distribution of Age within the dataset.

The chart below (Chart 1) shows the distribution of the population size and the basic statistics for the population are presented below the chart.

## [1] "Task 1.1 - Using fmean function - the Mean of the population is -  27.8"

## [1] "Using fmedian function - the Median of the population is -  28"

## [1] "Tasks 1.2 - The Standard Deviation of the population is -  16.31"

## [1] "The Summary calculations, grouped by Age - "

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   13.75   27.50   27.50   41.25   55.00

Task 2 - Analysis by Region

Following on from the total population age analysis is to evaluate the population by region. The eventual goal of the region analysis is to identify 2 regions that will be the most favourable to launch the new energy drink.

An important aspect of the region analysis is to confirm if there is a normal distribution within the population across regions. The normal (or Gaussian) distribution is the most common occurring distribution in a population set, and provides a level of predictability and consistencey when performing analysis. If the distribution isn’t normal we would need to find other reasons for the dataset to be displaying alternate patters, for example, there could be errors in the data collection or analysis process.

At this stage of the analysis we can also produce a range of statistics across the region grouping, extending on the mean, median and standard deviation. The interquartile range provides a measure of variability within the data set, and assists in identifying outliers or skewed results. We can also look at Minimum and Maximum values witin the region group, to immediately identify regions that can be excluded from our quest to find our launch regions.

This analysis is achieved through the following important functions in R -

min / max - these functions provide the minimum and maximum value within a range.
quantile - this is used to identify the middle range of the data set which represents values greater than 25% and less than 75%, assuming the data is within a normal distribution.

The table below shows a snippet of the summary statistics for each Region.

Tasks 2.1 - Summary statistics for Region Mean

Tasks 2.2 - Normal Distribution of Region

As we can see below in Chart 2, the majority of mean Age values for each region align to the average age. When we order the mean from lowest to highest, we see the grouping where some regions have lower mean ages and some have higher mean ages as demonstrated with the curves at either end of the region list. The line representing the overall population mean is well aligned to the majority of the region values. Chart 2 represents the Gaussian quantiles plot which is the inverse of the cumulative density (Zoonekynd V. 2007).

When we evaluate a histogram view as represented in Chart 3, we can clearly see a normal distribution with the majority of values grouping around the mean line, and tapering on either side. There is a prominent skew to the right, denoting a higher older population, and visually we can see the bulk of the population between 26 and 32 years of age.

The other important aspect of the visual display of the normal distribution is that the mean represents the maximum point on the curve. We can see that the mean line (27.8) crosses the value with the highest count within the historgram.

Task 3 - Highest population region

This section evaluates the region with the highest population to examine any similarities or variations that may exist to the overall population.

We also begin to examine differences between gender, also looking at similarities or variations that may exist between gender populations.

If the statistics across this region are comparable, then we are closer to being more confident in the data set and understanding the groups we will need to target in order to launch our new energy drink.

This section utilises visualisations to convey each group and the data sets are filtered by Region and Gender in order to create subsets of the overall data.

Task 3.1 - Highest population region statistics

## [1] "Task 3.1 - The Region with the highest population is  SSC22015  with  37948  people."

Task 3.2 - The summary statistics for Age in this region

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   13.75   27.50   27.50   41.25   55.00

Task 3.3 - Compare the overall population with the highest region population

Task 3.4 - Distribution of Age for Males in the highest population region

Task 3.5 - Distribution of Age for Females in the highest population region

Task 3.6 - Comparison of Female and Male populations for the highest region

Task 4

Task 4.1 - Age ratio - older to younger people (‘younger’ is below 40 years and ‘older’ is 40 years and above)

As per the table below, 26% of the overall population are considered ‘Old’ and 74% of the population are considered ‘Young’. This presents a ratio of Old to Young people of 26:74 or (13:37) - for every 13 old people there are 37 young people. In order to evaluate this in more detail we can break down the ratios into particular visualisations.

Task 4.2 - Plot the ratio of each region against its population size.

Referring to Chart 8, we can see the high Young to Old age ratio (74% ‘Young’ population) when comparing age < 40 (Young) to age > 40 (Old). This pie chart presents a quick and simple way to visualise that ratio split.

Task 4.3 - Trends in the Age ratio data

Examining the Regional Age ratio data further, in chart 9 we can see a distinct difference in population rations across regions, also demonstrating the on average the higher population of Young to Old. Within chart 9 we can also see the number of Regions that have only Young or Old populations, visible on the far right as the 1 ratio. There are just over 30 regions with only an old population and similarly around 35 regions within only a young population. The particular trends we can examine here are -

1. The old population ratio is skewed to the left, meaning a weighting toward lower population ratios, while the opposite is apparent in the young population.
2. 1. There are a significant number of regions with around 50% old and young people, denoted but the spike between both left and right skewed charts.

In chart 10 we have removed the regions with only young or old populations, treating these as outliers with the analysis. Given the skewed results within each age group observed in chart 9 we’ve also calculated the Median of the age ratios, aiming to see a more accurate indication of the population ranges. As we can see the median ratios (minus ‘outliers’) are slightly different at 36% old population and 64% young population. Again, there is a significant amount observed around the 50% ratio for both Young and Old populations. within this chart.

By adding a Loess (LOcally WEighted Scatter-plot Smoother) curve to each group we can also see the prominent differences between these popultion groups.

Task 5

Once again consider all regions: 5.1 For each region, calculate the ratio of males to females. 5.2 Plot the ratio of each region against its population size. 5.3 Comment on any trends you see in the data. What could explain such trends?

Task 5.1 - Gender ratio - Males to Females

As per the table below, we can see the ratio of Female to Male is 50%. In terms of overall population count there are 4,995 more females than males. In terms of the overall population (n=796015) this additional female population only equates to 0.006 (or 0.6%) of the overall total which is not a significant amount to favour one gender over another.

Task 5.2 - Gender ratio - Males to Females

As per the table below,

Task 6

Imagine you have enough financial resources for launching a new energy drink in any two regions: 6.1 Select a gender and age group which spans 3 to 5 years. This will be the primary target market for your hypothetical energy drink. 6.2 Which two regions would you choose? Explain your reasoning. 6.3 In planning each region’s campaign launch, you believe that 15% of your primary target market in the region will attend the launch. Use this assumption to estimate the number of the primary target market that you expect to attend in each region. Also estimate the likelihood that at least 30% of the primary target market will attend in each region. Explain your reasoning for both estimates.

Task 6.1 - Selection of Age renage and Gender

For the selection of age range, we use age around the population mean. Given the mean is 27.8 we will use an age range on eaither side of the mean. This results in the lower age range of 27 and and upper age range of 29.

For the selection of gender, we have already derived that the population is basically a 50/50 split. We expect we can gain a slightly higher proportion of the population within the female gender but despite this fact, a small amount of research reveals that males are more likely to consume energy drinks, “Males were significantly more likely than females to be weekly energy drink consumers and to have consumed at least one energy drink.” (Cancer Council, 2018). Other resources also support this trend in the United States (Statistia, 2016).

## [1] "Selected Age Range - Lower Age Range = 27 Upper Age Range = 29"

## [1] "Highest population gender = F"

Task 6.2 - Selection of Regions

## [1] "Region 1 = SSC22015, Region 2 = SSC21143"

## [1] "Population 1 = 1056, Population 2 = 969"

Task 6.3 - Probability of 30% attending

## [1] "Total Population = 2025"

## [1] "15% of the population attending the target launch = 303.75"

References

[R] pnorm how to decide lower-tail true or false Robert A LaBudde 2007, accessed 2 September 2023, https://stat.ethz.ch/pipermail/r-help/2007-June/133748.html

https://statisticsglobe.com/add-mean-and-median-to-histogram-in-r

http://zoonek2.free.fr/UNIX/48_R/07.html

https://www.statsdirect.com/help/nonparametric_methods/loess.htm#:~:text=Menu%20location%3A%20Analysis_LOESS.,WEighted%20Scatter%2Dplot%20Smoother).

Energy drink consumption and Sleep in Australian Secondary School Students https://www.cancer.org.au/assets/pdf/energy-drink-consumption-and-sleep-in-australian-secondary-school-students#:~:text=In%202018%2C%207%25%20of%20Australian,at%20least%20one%20energy%20drink.

More men than women drink sugar sweetened drinks https://www.abs.gov.au/articles/more-men-women-drinking-sugar-sweetened-drinks

Energy drink consumption frequency in the United States 2016, by gender. https://www.statista.com/statistics/623443/energy-drink-consumption-frequency-in-the-us-by-gender/

IMPORTANT NOTE

The report must be uploaded to Assignment 1 section in Canvas as a PDF document with R codes and outputs showing. The easiest way to do this is to:
1) Run all R chunks
2) Preview your notebook in HTML (by clicking Preview Notebook)
3) Open in Browser (Chrome)
4) Right Click on the report in Chrome
5) Click Print and Select the Destination Option to Save as PDF.
6) Now upload this PDF report as one single file via the Assignment 1 page in Canvas.

Remember to DELETE the instructional text provided in the template. Failure to do this will INCREASE the SIMILARITY INDEX reported in TURNITIN

If you have any questions regarding the assignment instructions or the R Markdown template, please post them on the discussion board.