The following project is an introduction to statistical data analysis. The aim of the project is to learn aspects of statistics such as mean, median and standard deviation and probability models such as Normal (or Gaussian) distribution using various functions and visualisations using the language ‘R’. By examining a real-life dataset from the Australian Bureau of Statistics (ABS) the project will evaluate population data around Age, Gender, and Regional location, in order to predict where a marketing campaign for a new Energy Drink will be most successful.
This analysis project uses the following typical practices of statistical analysis:
Load the ABS data to being the analysis work. The short data snippet below shows the format of the dataset.
The first step is to perform high-level analysis across the entire dataset. This will provide an overall understanding of the total population.
Initially we will examine the mean age of all people and the standard deviation of all people included in the dataset.
These statistics are achieved through the following important functions in R -
The chart below (Chart 1) shows the distribution of the population size and the basic statistics for the population are presented below the chart.
## [1] "Task 1.1 - Using fmean function - the Mean of the population is - 27.8"
## [1] "Using fmedian function - the Median of the population is - 28"
## [1] "Tasks 1.2 - The Standard Deviation of the population is - 16.31"
## [1] "The Summary calculations, grouped by Age - "
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 13.75 27.50 27.50 41.25 55.00
Following on from the total population age analysis is to evaluate the population by region. The eventual goal of the region analysis is to identify 2 regions that will be the most favourable to launch the new energy drink.
An important aspect of the region analysis is to confirm if there is a normal distribution within the population across regions. The normal (or Gaussian) distribution is the most common occurring distribution in a population set, and provides a level of predictability and consistencey when performing analysis. If the distribution isn’t normal we would need to find other reasons for the dataset to be displaying alternate patters, for example, there could be errors in the data collection or analysis process.
At this stage of the analysis we can also produce a range of statistics
across the region grouping, extending on the mean, median and standard
deviation. The interquartile range provides a measure of variability
within the data set, and assists in identifying outliers or skewed
results. We can also look at Minimum and Maximum values witin the region
group, to immediately identify regions that can be excluded from our
quest to find our launch regions.
This analysis is achieved through the following important functions in R -
The table below shows a snippet of the summary statistics for each Region.
As we can see below in Chart 2, the majority of mean Age values for each region align to the average age. When we order the mean from lowest to highest, we see the grouping where some regions have lower mean ages and some have higher mean ages as demonstrated with the curves at either end of the region list. The line representing the overall population mean is well aligned to the majority of the region values. Chart 2 represents the Gaussian quantiles plot which is the inverse of the cumulative density (Zoonekynd V. 2007).
When we evaluate a histogram view as represented in Chart 3, we can clearly see a normal distribution with the majority of values grouping around the mean line, and tapering on either side. There is a prominent skew to the right, denoting a higher older population, and visually we can see the bulk of the population between 26 and 32 years of age.
The other important aspect of the visual display of the normal distribution is that the mean represents the maximum point on the curve. We can see that the mean line (27.8) crosses the value with the highest count within the historgram.
This section evaluates the region with the highest population to examine any similarities or variations that may exist to the overall population.
We also begin to examine differences between gender, also looking at similarities or variations that may exist between gender populations.
If the statistics across this region are comparable, then we are closer to being more confident in the data set and understanding the groups we will need to target in order to launch our new energy drink.
This section utilises visualisations to convey each group and the data sets are filtered by Region and Gender in order to create subsets of the overall data.
## [1] "Task 3.1 - The Region with the highest population is SSC22015 with 37948 people."
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 13.75 27.50 27.50 41.25 55.00
As per the table below, 26% of the overall population are considered ‘Old’ and 74% of the population are considered ‘Young’. This presents a ratio of Old to Young people of 26:74 or (13:37) - for every 13 old people there are 37 young people. In order to evaluate this in more detail we can break down the ratios into particular visualisations.
Referring to Chart 8, we can see the high Young to Old age ratio (74% ‘Young’ population) when comparing age < 40 (Young) to age > 40 (Old). This pie chart presents a quick and simple way to visualise that ratio split.
Examining the Regional Age ratio data further, in chart 9 we can see a distinct difference in population rations across regions, also demonstrating the on average the higher population of Young to Old. Within chart 9 we can also see the number of Regions that have only Young or Old populations, visible on the far right as the 1 ratio. There are just over 30 regions with only an old population and similarly around 35 regions within only a young population. The particular trends we can examine here are -
In chart 10 we have removed the regions with only young or old populations, treating these as outliers with the analysis. Given the skewed results within each age group observed in chart 9 we’ve also calculated the Median of the age ratios, aiming to see a more accurate indication of the population ranges. As we can see the median ratios (minus ‘outliers’) are slightly different at 36% old population and 64% young population. Again, there is a significant amount observed around the 50% ratio for both Young and Old populations. within this chart.
By adding a Loess (LOcally WEighted Scatter-plot Smoother) curve to each group we can also see the prominent differences between these popultion groups.
Once again consider all regions: 5.1 For each region, calculate the ratio of males to females. 5.2 Plot the ratio of each region against its population size. 5.3 Comment on any trends you see in the data. What could explain such trends?
As per the table below, we can see the ratio of Female to Male is 50%. In terms of overall population count there are 4,995 more females than males. In terms of the overall population (n=796015) this additional female population only equates to 0.006 (or 0.6%) of the overall total which is not a significant amount to favour one gender over another.
As per the table below,
Imagine you have enough financial resources for launching a new energy drink in any two regions: 6.1 Select a gender and age group which spans 3 to 5 years. This will be the primary target market for your hypothetical energy drink. 6.2 Which two regions would you choose? Explain your reasoning. 6.3 In planning each region’s campaign launch, you believe that 15% of your primary target market in the region will attend the launch. Use this assumption to estimate the number of the primary target market that you expect to attend in each region. Also estimate the likelihood that at least 30% of the primary target market will attend in each region. Explain your reasoning for both estimates.
For the selection of age range, we use age around the population mean. Given the mean is 27.8 we will use an age range on eaither side of the mean. This results in the lower age range of 27 and and upper age range of 29.
For the selection of gender, we have already derived that the population is basically a 50/50 split. We expect we can gain a slightly higher proportion of the population within the female gender but despite this fact, a small amount of research reveals that males are more likely to consume energy drinks, “Males were significantly more likely than females to be weekly energy drink consumers and to have consumed at least one energy drink.” (Cancer Council, 2018). Other resources also support this trend in the United States (Statistia, 2016).
## [1] "Selected Age Range - Lower Age Range = 27 Upper Age Range = 29"
## [1] "Highest population gender = F"
## [1] "Region 1 = SSC22015, Region 2 = SSC21143"
## [1] "Population 1 = 1056, Population 2 = 969"
## [1] "Total Population = 2025"
## [1] "15% of the population attending the target launch = 303.75"
[R] pnorm how to decide lower-tail true or false Robert A LaBudde 2007, accessed 2 September 2023, https://stat.ethz.ch/pipermail/r-help/2007-June/133748.html
https://statisticsglobe.com/add-mean-and-median-to-histogram-in-r
http://zoonek2.free.fr/UNIX/48_R/07.html
https://www.statsdirect.com/help/nonparametric_methods/loess.htm#:~:text=Menu%20location%3A%20Analysis_LOESS.,WEighted%20Scatter%2Dplot%20Smoother).
Energy drink consumption and Sleep in Australian Secondary School Students https://www.cancer.org.au/assets/pdf/energy-drink-consumption-and-sleep-in-australian-secondary-school-students#:~:text=In%202018%2C%207%25%20of%20Australian,at%20least%20one%20energy%20drink.
More men than women drink sugar sweetened drinks https://www.abs.gov.au/articles/more-men-women-drinking-sugar-sweetened-drinks
Energy drink consumption frequency in the United States 2016, by gender. https://www.statista.com/statistics/623443/energy-drink-consumption-frequency-in-the-us-by-gender/
The report must be uploaded to Assignment 1 section in Canvas as a
PDF document with R codes and outputs showing. The
easiest way to do this is to:
1) Run all R chunks
2)
Preview your notebook in HTML (by
clicking Preview Notebook)
3) Open in Browser
(Chrome)
4) Right Click on the report in
Chrome
5) Click Print and Select the
Destination Option to Save as PDF.
6) Now upload
this PDF report as one single file via the Assignment 1 page in Canvas.
Remember to DELETE the instructional text provided in
the template. Failure to do this will INCREASE the SIMILARITY INDEX
reported in TURNITIN
If you have any questions regarding the assignment instructions or the R Markdown template, please post them on the discussion board.