1.1 Print out the dimensions of the data frame.
## [1] 56000 4
1.2 Print out the names and type of each of the data frame’s columns.
## tibble [56,000 x 4] (S3: tbl_df/tbl/data.frame)
## $ region : chr [1:56000] "SSC20005" "SSC20005" "SSC20005" "SSC20005" ...
## $ age : num [1:56000] 0 0 1 1 2 2 3 3 4 4 ...
## $ gender : chr [1:56000] "M" "F" "M" "F" ...
## $ population: num [1:56000] 0 0 0 0 0 0 0 0 0 0 ...
1.3 Print out the number of unique regions in the dataset (500 unique regions, each with 112 observations).
## 'data.frame': 500 obs. of 2 variables:
## $ Group.1: chr "SSC20005" "SSC20012" "SSC20018" "SSC20027" ...
## $ x : int 112 112 112 112 112 112 112 112 112 112 ...
1.4 What is the minimum age bin? Ans: 0 year
1.5 What is the maximum age bin? Ans: 55 years
1.6 What is the bin size for the age field? Ans: 1 year
2.1 Use the expected value for the age to find the mean age for the whole data sample Ans: Expected Value is 27.80
## [1] 27.80027
2.2 Standard Deviation for whole data sample
Ans=Sample Standard Deviation= 15.778,the same as the population Standard Deviation of 15.778 to 3dp.
## [1] 15.77804
## [1] 15.77818
Question 3 Statistics of mean age for each region
3.1 Mean=30.608
3.2 SD=7.996
3.3 Minimum = 2
3.4 First Quartile = 27.426
3.5 Median = 29.232
3.6 Third Quartile = 33.35
3.7 Maximum = 55
3.8 IQR = 5.924
## [1] "1"
## [1] 30.608
## [1] 7.9962
## [1] 2
## 25%
## 27.426
## 50%
## 29.232
## 75%
## 33.35
## [1] 55
## [1] 5.9243
3.9 Histogram of the distribution of means

## $breaks
## [1] 0 5 10 15 20 25 30 35 40 45 50 55
##
## $counts
## [1] 2 8 6 10 36 232 106 45 26 13 16
##
## $density
## [1] 0.0008 0.0032 0.0024 0.0040 0.0144 0.0928 0.0424 0.0180 0.0104 0.0052
## [11] 0.0064
##
## $mids
## [1] 2.5 7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5 52.5
##
## $xname
## [1] "WMS"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
Question 4 Region with smallest population
SSC20099 is one of the regions with the smallest population of 3 people
Question 5 Region with largest population
From the plot in 5.1, it is observed that:
1. The population is highest below 5 and around 30 years old.
2. The population declines from 5 to 20, and beyond 30 years old.
3. The trend suggests that the region is populated by mainly young families with young children.
## Group.1 population
## 1 SSC22015 37948

## geom_step: na.rm = FALSE
## stat_ecdf: n = NULL, pad = TRUE, na.rm = FALSE
## position_identity


6.1 Scatter Plot: Ratio of old to young vs population
## [1] "1"

The scatter plot in 6.1 indicates the following trends:
1. When the population is low, the ratio of old to young is high. This suggests that there are more old people than young people when the population of a region is low.
2. When the population is high, the ratio is low. This shows that in the more populous regions, there are more young people than old people.
3. This is consistent with a trend that older people would move to a small country town where the cost of housing is cheaper, and they will have a greater spending power with their limited funds and many of the people are likely retired.
4. The younger couples and people will live in a more populous region for cheaper housing, jobs, schooling, health facility and other conveniences to support their life styles.
5. There are many more in between the two extremes, and the trend will depend on a combination of factors. e.g stages of their life, wealth levels, empty nesters, availability of jobs etc.
7.1 Scatter Plot: Ratio of female to male vs population
## [1] "1"
## [1] 5.3333
## [1] 0

The scatter plot in 7.1 indicates the following trends:
1. In regions where the population is low, the ratio of female/male ranges from very low to a high of 5.33.
2. One possible reason for this trend is related to Question 6.2, where retired people tended to move to small country towns. Female generally live longer than male, and more males have died in these low population regions.
3. Another possible reason for the high ratio could be that the male of the family who are still fit and healthy to work, will go to more populous regions to find work and send money home. This way the female members could stay put in a region where cost of living is lower, and rely on the income earned by the male members of the family. e.g. This is quite typical in countries like china, where the male members will go from rural villages to the big city to earn an income to support his family in the poor villages.
4. The red regression line shows that the ratio trends around 1, representing the more common balance between female and male in most regions.
8.1
Females 18 to 21 have been chosen as the primary customers for the hypothetical product of a face cream that will make a young woman look even more beautiful. In addition, the purchaser of the product has a chance of referring other customers to buy the product. The top three referrers will earn a free dinner for two in a five star restaurant.
8.2
1. The two regions with the largest population of females between 18 and 21 are SSC22015 (1113) and SSC20492 (2566).
The four plots in 9.3 show that as n increases, the sample distribution approaches the normal distribution, confiming CLT which states that the distribution of sample means of any distribution will tend to the normal distribution. (Math2406, Applied Analytic 3.20)




Reference
Adding manual legend to ggplot2, viewed 20 Jan 2022 https://community.rstudio.com/t/adding-manual-legend-to-ggplot2/41651
Convert a Numeric Object to Character, viewed 20 Jan 2022 https://www.geeksforgeeks.org/convert-a-numeric-object-to-character-in-r-programming-as-character-function/
Convert Factor to Numeric, viewed 20 Jan 2022 https://www.geeksforgeeks.org/convert-factor-to-numeric-and-numeric-to-factor-in-r-programming/
Filter function, viewed 19 Jan 2022 https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/filter
ggplot2 scatter plots, viewed 23 Jan 2022 http://www.sthda.com/english/wiki/ggplot2-scatter-plots-quick-start-guide-r-software-and-data-visualization
How to Make ECDF Plot with ggplot2, viewed 20 Jan 2022 https://www.geeksforgeeks.org/how-to-make-ecdf-plot-with-ggplot2-in-r/
Make a Histogram, viewed 19 Jan 2022 https://www.datacamp.com/community/tutorials/make-histogram-basic-r
Random number generator, viewed 25 Jan 2022 https://www.educba.com/random-number-generator-in-r/
Repeating rows, viewed 25 Jan 2022 https://stackoverflow.com/questions/8753531/repeat-rows-of-a-data-frame-n-times
RMIT Course Math 2404 Data Visualisation and Communication
RMIT Course Math 2406 Appplied Analytic
Select certain rows, viewed 23 Jan 2022 https://stackoverflow.com/questions/2854625/select-only-rows-if-its-value-in-a-particular-column-is-less-than-the-value-in-t
Select values that have specific characters, viewed 22 Jan 2022 https://community.rstudio.com/t/how-to-select-values-that-have-specific-characters/68748
Sorting, viewed 19 Jan 2022, viewed 19 Jan 2022 https://www.datacamp.com/community/tutorials/sorting-in-r
6.2 Comment on Trends