Setup

Install and load the packages you need to produce the report here:

# running library strings for codes used herein

library(dplyr) # Useful for data manipulation
library(ggplot2) # Useful for building data visualisations
library(knitr) # Useful for creating nice tables
library(here)

Instructions

Follow the instructions given in the Assessment brief to fill out the template below. Remember to include R codes and outputs, and where applicable plain text explanations.

Task 1

We will import the data, pop_dataset_0002.csv, and call it data. This will be manipulated throughout the series of steps, but we will import it again as data2 for later manipulation.

# load data
data <- read.csv(here("data", "pop_dataset_0002.csv"))
summary(data)
##     region               age           gender            population    
##  Length:56000       Min.   : 0.00   Length:56000       Min.   :  0.00  
##  Class :character   1st Qu.:13.75   Class :character   1st Qu.:  0.00  
##  Mode  :character   Median :27.50   Mode  :character   Median :  0.00  
##                     Mean   :27.50                      Mean   : 14.21  
##                     3rd Qu.:41.25                      3rd Qu.:  7.00  
##                     Max.   :55.00                      Max.   :726.00
# check for missing values
sum(is.na(data$region))
## [1] 0
sum(is.na(data$age))
## [1] 0
sum(is.na(data$gender))
## [1] 0
sum(is.na(data$population))
## [1] 0
#there are no missing values, but there are lots of values = 0
  
# replace missing values with mean values
data$age[is.na(data$age)] <- mean(data$age, na.rm = TRUE)
data$population[is.na(data$population)] <- mean(data$population, na.rm = TRUE)


#I will remove the rows where the population value = 0
data <- data[data$population != 0,]
summary(data)
##     region               age           gender            population    
##  Length:23121       Min.   : 0.00   Length:23121       Min.   :  3.00  
##  Class :character   1st Qu.:13.00   Class :character   1st Qu.:  4.00  
##  Mode  :character   Median :28.00   Mode  :character   Median : 11.00  
##                     Mean   :28.26                      Mean   : 34.43  
##                     3rd Qu.:43.00                      3rd Qu.: 38.00  
##                     Max.   :55.00                      Max.   :726.00

What I can see in the data

This is a brief explanation of what the data looks like. This is attached in order to assure myself that I’m looking at the dataset that is needed, and to assure the marker (the reader) that I have some understanding of what the dataframe is displaying.
Each region is separated and named with an “SSC” prefix then a number “21184”. Each region has an indication of individual ages (from 0-55), a reference for M or F (male or female) and then a population count for “males at age 0”, “females at age 0”, and so on. This means that each row is one age number (0-55), one gender (M or F), and a population. In order to collect the data form the regions, we must group each region’s data by adding together the populations across M and F, and between all age groups. In order to get means, however, we need to multiplying the age by the population for every range (other than age 0, as this will throw up an error).

We can then calculate a mean age for each region. We can also display a histogram by each region, with ‘age’ in the x variable, and population in the y variable. We can also display a histogram of all regions of their mean age (region on the x variable, and age on the y variable). Also, there are sets of null variables which throw out errors (0 in age, and various regions where there are 0 people in the population groups). 0 values like this will generate errors when processing.

I am unsure how to group together groups of data, so in the writing and reviewing of this particular document, I have taught myself to use (peripherally) the group as, filter, and other manipulation codes.

#counting frequencies of regions
data$region %>% table()
## .
## SSC20005 SSC20012 SSC20018 SSC20027 SSC20029 SSC20048 SSC20062 SSC20076 
##       10       81       29      111       16      111       75      112 
## SSC20079 SSC20099 SSC20101 SSC20106 SSC20107 SSC20127 SSC20135 SSC20140 
##      112        1       11        5      112        1      112       66 
## SSC20151 SSC20161 SSC20163 SSC20167 SSC20170 SSC20173 SSC20177 SSC20179 
##        1      111       45        4       74      112        7        5 
## SSC20190 SSC20191 SSC20196 SSC20200 SSC20201 SSC20204 SSC20205 SSC20211 
##      112      112        6      112        3       12        3       18 
## SSC20241 SSC20249 SSC20257 SSC20266 SSC20269 SSC20275 SSC20276 SSC20282 
##       15       27        2        7       47       16        2       37 
## SSC20293 SSC20305 SSC20306 SSC20312 SSC20313 SSC20325 SSC20330 SSC20337 
##       18      112       18      112      112        9        7       19 
## SSC20343 SSC20345 SSC20346 SSC20352 SSC20353 SSC20355 SSC20356 SSC20360 
##      112       11        1       32       18       10        5      112 
## SSC20361 SSC20366 SSC20367 SSC20373 SSC20378 SSC20383 SSC20391 SSC20392 
##      112       12       30       35        2        1       46       31 
## SSC20393 SSC20398 SSC20401 SSC20407 SSC20415 SSC20417 SSC20422 SSC20423 
##       48        1        7      112        1      112        1       20 
## SSC20433 SSC20446 SSC20452 SSC20453 SSC20473 SSC20477 SSC20491 SSC20492 
##       17        1       42      112      111       35       54      112 
## SSC20500 SSC20502 SSC20503 SSC20506 SSC20514 SSC20515 SSC20516 SSC20518 
##       13        1        1        8       52        7        1      112 
## SSC20519 SSC20522 SSC20523 SSC20525 SSC20534 SSC20536 SSC20551 SSC20556 
##      109        2       55      112      112      112        2       57 
## SSC20558 SSC20564 SSC20567 SSC20571 SSC20577 SSC20578 SSC20579 SSC20583 
##      112       55      112      112      110      112      112       37 
## SSC20605 SSC20615 SSC20617 SSC20620 SSC20628 SSC20656 SSC20660 SSC20668 
##       17       15       88      112        3        4      112        4 
## SSC20691 SSC20701 SSC20706 SSC20718 SSC20725 SSC20739 SSC20742 SSC20747 
##       18       86        6       41      112       56      111       11 
## SSC20770 SSC20773 SSC20774 SSC20787 SSC20796 SSC20798 SSC20799 SSC20810 
##       34      112        4       47      112       53        5       64 
## SSC20814 SSC20817 SSC20822 SSC20827 SSC20830 SSC20836 SSC20837 SSC20852 
##       15      112      112       96      112        6      103       22 
## SSC20858 SSC20864 SSC20865 SSC20870 SSC20879 SSC20883 SSC20890 SSC20894 
##      112       56      112        2      112      112       26        3 
## SSC20911 SSC20912 SSC20934 SSC20943 SSC20948 SSC20953 SSC20957 SSC20975 
##      112      112      107      112       36       26       20        4 
## SSC20977 SSC20986 SSC20989 SSC20995 SSC20999 SSC21004 SSC21011 SSC21012 
##        5        3        4       96       21        9       21        1 
## SSC21013 SSC21014 SSC21020 SSC21026 SSC21032 SSC21034 SSC21037 SSC21040 
##       50        1       16        1       38       12        1      112 
## SSC21049 SSC21055 SSC21061 SSC21062 SSC21077 SSC21088 SSC21092 SSC21101 
##      112        2       42       95       87        1        2       15 
## SSC21102 SSC21113 SSC21120 SSC21121 SSC21125 SSC21126 SSC21136 SSC21137 
##       18       65       20      112      112       18      112        1 
## SSC21141 SSC21143 SSC21144 SSC21152 SSC21169 SSC21170 SSC21175 SSC21178 
##       58      112      112      106       13      112        7      112 
## SSC21184 SSC21189 SSC21190 SSC21191 SSC21193 SSC21200 SSC21207 SSC21210 
##      112        2        2        3      102       10        2      112 
## SSC21211 SSC21214 SSC21218 SSC21233 SSC21238 SSC21244 SSC21251 SSC21256 
##        2      112        8       51      112       24        5       60 
## SSC21260 SSC21264 SSC21267 SSC21281 SSC21283 SSC21286 SSC21294 SSC21304 
##       72       36       10       13        1      103        8      112 
## SSC21305 SSC21308 SSC21316 SSC21320 SSC21324 SSC21329 SSC21332 SSC21342 
##        4      112        6      112       13      112      112        6 
## SSC21347 SSC21348 SSC21350 SSC21351 SSC21367 SSC21369 SSC21374 SSC21380 
##      112        6      112       57        4       16        3       64 
## SSC21387 SSC21405 SSC21408 SSC21412 SSC21416 SSC21417 SSC21421 SSC21426 
##        7        6       43       36        4       29       23      108 
## SSC21440 SSC21441 SSC21457 SSC21459 SSC21461 SSC21462 SSC21475 SSC21479 
##      112       74        7      106       15       22       21      103 
## SSC21494 SSC21505 SSC21511 SSC21514 SSC21517 SSC21534 SSC21536 SSC21544 
##        8       67      109       81       94      112       37      112 
## SSC21547 SSC21561 SSC21562 SSC21567 SSC21575 SSC21585 SSC21590 SSC21594 
##       21      112        7        2       98        9       19        7 
## SSC21597 SSC21601 SSC21602 SSC21620 SSC21630 SSC21638 SSC21639 SSC21654 
##       66      110       12       15       20      104        4       92 
## SSC21660 SSC21666 SSC21671 SSC21674 SSC21678 SSC21681 SSC21683 SSC21690 
##       25      112      112      112        3        7        5        4 
## SSC21691 SSC21701 SSC21702 SSC21719 SSC21730 SSC21732 SSC21734 SSC21736 
##       33        2       23        7      112        8      112       10 
## SSC21741 SSC21743 SSC21755 SSC21760 SSC21769 SSC21771 SSC21778 SSC21779 
##        5      112      108       91       10      112      112        7 
## SSC21784 SSC21798 SSC21801 SSC21802 SSC21803 SSC21805 SSC21808 SSC21812 
##       11       39       40       19       17        2       64       63 
## SSC21817 SSC21820 SSC21823 SSC21830 SSC21839 SSC21848 SSC21849 SSC21858 
##       12       92       21       18       11       17        5       56 
## SSC21862 SSC21863 SSC21888 SSC21889 SSC21890 SSC21892 SSC21899 SSC21900 
##       17       73       12        2        6        7      107       50 
## SSC21902 SSC21905 SSC21907 SSC21914 SSC21915 SSC21916 SSC21918 SSC21919 
##       13      108       14        4       67       55       67       11 
## SSC21928 SSC21932 SSC21939 SSC21945 SSC21946 SSC21947 SSC21950 SSC21951 
##      111       38        4        6        1        5       22      112 
## SSC21956 SSC21960 SSC21963 SSC21979 SSC21988 SSC22003 SSC22007 SSC22011 
##       42       22       21       19      112       10       34        6 
## SSC22012 SSC22015 SSC22019 SSC22021 SSC22028 SSC22029 SSC22030 SSC22039 
##        1      112       61        1        1      112      112       18 
## SSC22043 SSC22046 SSC22052 SSC22053 SSC22059 SSC22065 SSC22070 SSC22072 
##        4       26       58       31        3       52        1      112 
## SSC22075 SSC22076 SSC22082 SSC22084 SSC22085 SSC22086 SSC22089 SSC22096 
##       19      111       39        1        5       82       10       19 
## SSC22101 SSC22106 SSC22110 SSC22124 SSC22125 SSC22133 SSC22134 SSC22139 
##       28      112      112       26       77       35       42      112 
## SSC22157 SSC22168 SSC22170 SSC22175 SSC22180 SSC22185 SSC22190 SSC22193 
##        1       43        4        2        3       19       12        1 
## SSC22200 SSC22209 SSC22224 SSC22227 SSC22232 SSC22236 SSC22237 SSC22239 
##        5      112       25       83       10      112        1      112 
## SSC22254 SSC22263 SSC22265 SSC22273 SSC22274 SSC22281 SSC22283 SSC22284 
##      112        4      108        4        7        1       28       62 
## SSC22288 SSC22296 SSC22309 SSC22323 SSC22327 SSC22330 SSC22331 SSC22333 
##      109       69      112      112        9        7       13      112 
## SSC22338 SSC22340 SSC22342 SSC22344 SSC22349 SSC22356 SSC22366 SSC22367 
##       46       49        2       31        3       11       10        1 
## SSC22371 SSC22373 SSC22386 SSC22396 SSC22398 SSC22424 SSC22442 SSC22453 
##       24        4       86       17       24       11        4       39 
## SSC22467 SSC22468 SSC22470 SSC22476 SSC22482 SSC22489 SSC22490 SSC22495 
##      111        2        8      111        5       99       18        9 
## SSC22500 SSC22505 SSC22513 SSC22517 SSC22521 SSC22523 SSC22525 SSC22532 
##        5        8       46        2       95       49       20       10 
## SSC22534 SSC22556 SSC22561 SSC22569 SSC22586 SSC22597 SSC22598 SSC22605 
##      112      112       11      112       62        1       74        1 
## SSC22606 SSC22609 SSC22616 SSC22621 SSC22650 SSC22652 SSC22653 SSC22659 
##        1        2        1       26      102       24       22       67 
## SSC22660 SSC22696 SSC22698 SSC22701 SSC22702 SSC22703 SSC22706 SSC22708 
##      112       22        1        8        1        3        8      112 
## SSC22710 SSC22717 SSC22719 SSC22729 SSC22734 SSC22737 SSC22744 SSC22747 
##        4      112        1       44        6      110      112      112 
## SSC22752 SSC22755 SSC22761 SSC22764 SSC22772 SSC22773 SSC22779 SSC22800 
##      112      112      112        7        1        4      111       13 
## SSC22803 SSC22809 SSC22838 SSC22840 SSC22844 SSC22846 SSC22849 SSC22861 
##       18        1       50        1       10       20        4      111 
## SSC22864 SSC22867 SSC22869 SSC22871 SSC22873 SSC22877 SSC22886 SSC22910 
##        2        2       36        8        1      112        5       81 
## SSC22912 SSC22915 SSC22918 SSC22922 
##       51        7        7        5

I know the names of the regions, and the populations of all regions (as the 0 values have been omitted)

#attempt to group by, generating means for each group, and arranging in descending order

total_age_region <- data %>% 
  group_by(region) %>% 
  summarise(population = n()) %>% 
  arrange(desc(population))
total_age_region
#wow. I'm blown away that worked. Now to do the same thing but with the means generated

#mean_age_region <- data %>% 
#    group_by(region) %>% 
#  mean(population)
#  summarise(population = n()) %>% 
#  arrange(desc(populationd))

#I don't know how to do this
# essentially, I want to multiply the population by the age fo each row, and sum the output for each region
# so, essentially, there's a groupby function in a for loop somewhere., but I don't know how to do that.I'll try to use 2 parallel group by functions in the beginning of part 2. 
# I am journaling this challenge because I'm very unsure how this will work out, although there is a really clear example in the coursework (for the other statistics).

Task 2

Consider only the mean age of each region. 2.1 Produce the following summary statistics for the region means: •mean •standard deviation •minimum •first quartile •median •third quartile •maximum •interquartile range •histogram of the distribution of region means 2.2 Produce a histogram of the region means with proportions or percentages on the y axis. 2.3 Discuss whether the region means exhibit the characteristic shape of a normal distribution. Include at least two justifications in support of your conclusion.

# Group the data by region
data_by_region = data %>% group_by(region)
data_by_region
# Calculate the mean age for each region
region_means = data_by_region %>% summarize(mean_age = mean(age))
region_means
# Print summary statistics for the region means
print(paste("Mean of region means: ", mean(region_means$mean_age)))
## [1] "Mean of region means:  30.2632842465516"
print(paste("Standard deviation of region means: ", sd(region_means$mean_age)))
## [1] "Standard deviation of region means:  7.95217355449094"
print(paste("Minimum of region means: ", min(region_means$mean_age)))
## [1] "Minimum of region means:  2"
print(paste("First quartile of region means: ", quantile(region_means$mean_age, 0.25)))
## [1] "First quartile of region means:  27.5"
print(paste("Median of region means: ", median(region_means$mean_age)))
## [1] "Median of region means:  28.1848290598291"
print(paste("Third quartile of region means: ", quantile(region_means$mean_age, 0.75)))
## [1] "Third quartile of region means:  32.675"
print(paste("Maximum of region means: ", max(region_means$mean_age)))
## [1] "Maximum of region means:  55"
print(paste("Interquartile range of region means: ", IQR(region_means$mean_age)))
## [1] "Interquartile range of region means:  5.175"
# Create histogram of the region means
hist(region_means$mean_age, main = "Histogram of Region Means", xlab = "Mean Age", ylab = "Frequency")

#analysis of the region’s means The region’s means DO seem to show the characteristic shape of a bell curve. The mean for the whole group is around 27.5. There is a slight issue here, because the values above 50 are skewed upward (due to the source data pooling at 55 years of age). This, essentially, means that the age, 55 and 55+ are the same and should probably be ignored.

Task 3

Consider the region with the largest population size: 3.1 Identify the region and describe its population size in comparison with the other regions. 3.2 Produce summary statistics for age in this region. •mean •standard deviation •minimum •first quartile •median •third quartile •maximum •interquartile range •histogram of the distribution of region means.

# Create a plot of the population sum by region
# Group the data by region
data_by_region = data %>% group_by(region)

# Calculate the sum of population for each region
region_pop = data_by_region %>% summarize(pop_sum = sum(population))

# Print the sum of population for each region
print(region_pop)
## # A tibble: 500 × 2
##    region   pop_sum
##    <chr>      <dbl>
##  1 SSC20005      33
##  2 SSC20012     425
##  3 SSC20018     100
##  4 SSC20027    1137
##  5 SSC20029      51
##  6 SSC20048     924
##  7 SSC20062     359
##  8 SSC20076    5821
##  9 SSC20079    4978
## 10 SSC20099       3
## # … with 490 more rows
#plot a histogram to visualise the output
ggplot(data=region_pop, aes(x=region, y=pop_sum)) + geom_bar(stat = "identity")

# Arrange the data in descending order by the pop_sum column
region_pop_desc = region_pop %>% arrange(desc(pop_sum))

# Print the region population in descending order
print(region_pop_desc)
## # A tibble: 500 × 2
##    region   pop_sum
##    <chr>      <dbl>
##  1 SSC22015   37948
##  2 SSC21671   22979
##  3 SSC21125   20939
##  4 SSC20911   19340
##  5 SSC22569   19274
##  6 SSC21143   19180
##  7 SSC20773   18540
##  8 SSC22556   17928
##  9 SSC20865   17809
## 10 SSC20660   17442
## # … with 490 more rows

#answers for the largest population and summary statistics for that region the region with the largest population is SSC22015 with a population of 37948.

# filter data where region = SSC22015
data_SSC22015 = data[data$region == "SSC22015", ]
data_SSC22015
# Generate summary statistics for data_SSC22015
summary_stats2 = data_SSC22015 %>% summarise(mean_age = mean(age),
                                                 sd_age = sd(age),
                                                 min_age = min(age),
                                                 Q1_age = quantile(age, probs = 0.25),
                                                 median_age = median(age),
                                                 Q3_age = quantile(age, probs = 0.75),
                                                 max_age = max(age),
                                                 IQR_age = IQR(age))

# Print the summary statistics
print(summary_stats2)
##   mean_age   sd_age min_age Q1_age median_age Q3_age max_age IQR_age
## 1     27.5 16.23587       0  13.75       27.5  41.25      55    27.5

3.3 How does the age distribution for this region compare with the distribution of means provided in Task 2? You may use visualisations to supplement your discussion.

#filter data where region = ssc22015
data_SSC22015 = data[data$region == "SSC22015", ]
summary (data_SSC22015)
##     region               age           gender            population   
##  Length:112         Min.   : 0.00   Length:112         Min.   :175.0  
##  Class :character   1st Qu.:13.75   Class :character   1st Qu.:280.2  
##  Mode  :character   Median :27.50   Mode  :character   Median :325.0  
##                     Mean   :27.50                      Mean   :338.8  
##                     3rd Qu.:41.25                      3rd Qu.:404.5  
##                     Max.   :55.00                      Max.   :527.0
#filter data where region = ssc22015= "SSC22015", ]

# Comparing Age Distributions for Two Data Sets
### Visualizing the Data

# Boxplot comparison
boxplot(data$age, data_SSC22015$age, names = c("Full Pop data","SSC22015"))

# Histogram comparison
hist(data$age, probability = TRUE, col = "blue", main = "Age Distribution in Full Pop data", xlab = "Age Group")

hist(data_SSC22015$age, probability = TRUE, col = "red", main = "Age Distribution in SSC22015", xlab = "Age Group")

### Summary Statistics

# Descriptive statistics
summary(data$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   13.00   28.00   28.26   43.00   55.00
summary(data_SSC22015$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   13.75   27.50   27.50   41.25   55.00
# Differences between the two data sets
mean_diff <- mean(data$age) - mean(data_SSC22015$age)
median_diff <- median(data$age) - median(data_SSC22015$age)

The box plot and summary data suggest that the age range between the full data set and the region SSC 22015 are very similar.

3.4 Plot the distribution of age for males in the region. 3.5 Plot the distribution of age for females in the region. 3.6 Compare the distributions of Task 3.4 and Task 3.5, and discuss your findings.

# Create a plot of the population sum by gender for ssc22015
# Group the data by gender
region_data <- data[data$region == "SSC22015",]
summary(region_data)
##     region               age           gender            population   
##  Length:112         Min.   : 0.00   Length:112         Min.   :175.0  
##  Class :character   1st Qu.:13.75   Class :character   1st Qu.:280.2  
##  Mode  :character   Median :27.50   Mode  :character   Median :325.0  
##                     Mean   :27.50                      Mean   :338.8  
##                     3rd Qu.:41.25                      3rd Qu.:404.5  
##                     Max.   :55.00                      Max.   :527.0
print(region_data)
##        region age gender population
## 1457 SSC22015   0      M        455
## 1458 SSC22015   0      F        423
## 1459 SSC22015   1      M        492
## 1460 SSC22015   1      F        479
## 1461 SSC22015   2      M        465
## 1462 SSC22015   2      F        453
## 1463 SSC22015   3      M        478
## 1464 SSC22015   3      F        497
## 1465 SSC22015   4      M        527
## 1466 SSC22015   4      F        438
## 1467 SSC22015   5      M        434
## 1468 SSC22015   5      F        413
## 1469 SSC22015   6      M        415
## 1470 SSC22015   6      F        404
## 1471 SSC22015   7      M        396
## 1472 SSC22015   7      F        396
## 1473 SSC22015   8      M        409
## 1474 SSC22015   8      F        358
## 1475 SSC22015   9      M        354
## 1476 SSC22015   9      F        341
## 1477 SSC22015  10      M        371
## 1478 SSC22015  10      F        352
## 1479 SSC22015  11      M        329
## 1480 SSC22015  11      F        329
## 1481 SSC22015  12      M        303
## 1482 SSC22015  12      F        274
## 1483 SSC22015  13      M        305
## 1484 SSC22015  13      F        292
## 1485 SSC22015  14      M        295
## 1486 SSC22015  14      F        255
## 1487 SSC22015  15      M        249
## 1488 SSC22015  15      F        290
## 1489 SSC22015  16      M        284
## 1490 SSC22015  16      F        261
## 1491 SSC22015  17      M        286
## 1492 SSC22015  17      F        265
## 1493 SSC22015  18      M        258
## 1494 SSC22015  18      F        286
## 1495 SSC22015  19      M        282
## 1496 SSC22015  19      F        259
## 1497 SSC22015  20      M        254
## 1498 SSC22015  20      F        243
## 1499 SSC22015  21      M        263
## 1500 SSC22015  21      F        325
## 1501 SSC22015  22      M        306
## 1502 SSC22015  22      F        340
## 1503 SSC22015  23      M        311
## 1504 SSC22015  23      F        338
## 1505 SSC22015  24      M        316
## 1506 SSC22015  24      F        363
## 1507 SSC22015  25      M        302
## 1508 SSC22015  25      F        411
## 1509 SSC22015  26      M        372
## 1510 SSC22015  26      F        424
## 1511 SSC22015  27      M        320
## 1512 SSC22015  27      F        390
## 1513 SSC22015  28      M        356
## 1514 SSC22015  28      F        447
## 1515 SSC22015  29      M        380
## 1516 SSC22015  29      F        411
## 1517 SSC22015  30      M        455
## 1518 SSC22015  30      F        482
## 1519 SSC22015  31      M        428
## 1520 SSC22015  31      F        478
## 1521 SSC22015  32      M        414
## 1522 SSC22015  32      F        482
## 1523 SSC22015  33      M        410
## 1524 SSC22015  33      F        406
## 1525 SSC22015  34      M        396
## 1526 SSC22015  34      F        459
## 1527 SSC22015  35      M        413
## 1528 SSC22015  35      F        391
## 1529 SSC22015  36      M        339
## 1530 SSC22015  36      F        382
## 1531 SSC22015  37      M        348
## 1532 SSC22015  37      F        386
## 1533 SSC22015  38      M        364
## 1534 SSC22015  38      F        313
## 1535 SSC22015  39      M        325
## 1536 SSC22015  39      F        335
## 1537 SSC22015  40      M        294
## 1538 SSC22015  40      F        294
## 1539 SSC22015  41      M        305
## 1540 SSC22015  41      F        312
## 1541 SSC22015  42      M        297
## 1542 SSC22015  42      F        291
## 1543 SSC22015  43      M        294
## 1544 SSC22015  43      F        312
## 1545 SSC22015  44      M        347
## 1546 SSC22015  44      F        354
## 1547 SSC22015  45      M        293
## 1548 SSC22015  45      F        319
## 1549 SSC22015  46      M        275
## 1550 SSC22015  46      F        311
## 1551 SSC22015  47      M        264
## 1552 SSC22015  47      F        294
## 1553 SSC22015  48      M        271
## 1554 SSC22015  48      F        265
## 1555 SSC22015  49      M        248
## 1556 SSC22015  49      F        271
## 1557 SSC22015  50      M        218
## 1558 SSC22015  50      F        227
## 1559 SSC22015  51      M        211
## 1560 SSC22015  51      F        261
## 1561 SSC22015  52      M        262
## 1562 SSC22015  52      F        218
## 1563 SSC22015  53      M        238
## 1564 SSC22015  53      F        274
## 1565 SSC22015  54      M        194
## 1566 SSC22015  54      F        217
## 1567 SSC22015  55      M        175
## 1568 SSC22015  55      F        212
# plot age distribution for males
#ggplot(data = region_data[region_data$gender == "M", ], aes(x = age, y= population)) +  geom_histogram(aes(y = population), fill = "blue", alpha = 0.5) +  scale_y_continuous(limits = c(0, 600)) +  ggtitle("Age Distribution for Males in SSC22015")

# plot age distribution for females
#ggplot(data = region_data[region_data$gender == "F", ], aes(x = age, y= population)) +  geom_histogram(aes(y = population), fill = "pink", alpha = 0.5) +  scale_y_continuous(limits = c(0, 600)) +  ggtitle("Age Distribution for Females in SSC22015")

I am really unsure why these histograms aren’t showing. I worry that there’s a fundamental step that I’m missing, but each time I read this they seem to read okay: Essentially, I’m doing the following; gather ggplot data, from the table region_data, specifically the gender column when it is equal to M or F (accordingly). Set that data so that the age is in the x variable, and population is in the y variable. ; then dray a histogram showing the density in pink (so there should be a gradient in colour like what we had in the material for this course). For Males, that gradient will be in Blue, and for Females that will be in Pink. The y variable scale will be from 0-600 to cover all the known variables in the example. The tile will be “age distribution for males/females in ssc22015”.

Task 4

Now consider all regions: 4.1 For each region, calculate the ratio of older to younger people, where ‘younger’ is defined as aged below 40 years and ‘older’ as age 40 years and above. 4.2 Plot the ratio of each region against its population size. 4.3 Comment on any trends you see in the data. What could explain such trends?

# Create a new column to indicate whether the individual is older or younger
data$age <- ifelse(data$age >= 40, "Older", "Younger")

# Create a new dataframe to group the data by region and age group
region_age_group <- data %>% group_by(region) %>% group_by(age) %>%
  summarise(population = sum(population))
str(region_age_group)
## tibble [2 × 2] (S3: tbl_df/tbl/data.frame)
##  $ age       : chr [1:2] "Older" "Younger"
##  $ population: num [1:2] 221294 574721
# Create a new column to indicate the ratio of older to younger people calling it 'ratio'
region_age_group$ratio <- region_age_group$population / region_age_group[region_age_group$age == "Older", "population"]
str(region_age_group$ratio)
## 'data.frame':    2 obs. of  1 variable:
##  $ population: num  1 1
# Plot the ratio of older to younger people against population size
#ggplot(data = region_age_group, aes(x = population, y = ratio)) +   geom_point() +   ggtitle("Ratio of Older to Younger People by Region") +   xlab("Population Size") +   ylab("Ratio of Older to Younger People")

Task 5

# calculate the ratio of males to females for each region
ratio <- data %>%
  group_by(region) %>%
  summarize(male_ratio = sum(gender == "M") / sum(gender == "F")) 
ratio
summarise(ratio)
str(ratio)
## tibble [500 × 2] (S3: tbl_df/tbl/data.frame)
##  $ region    : chr [1:500] "SSC20005" "SSC20012" "SSC20018" "SSC20027" ...
##  $ male_ratio: num [1:500] 0.667 1.025 1.636 0.982 1.286 ...
#ggplot(aes(x = data$population, y = ratio$male_ratio)) + geom_point() +  xlab("Population Size") +  ylab("Ratio of Males to Females") +  ggtitle("Ratio of Males to Females by Region")
# this plot seems to throw up an error of various sorts

Comments: There are several inf findings for some regions. This would mean that the values for M and F are 0, throwing the equation 0/0= (an error due to it being infinite in output). Some ratios are even (equaling 1)

Task 6

In order to make a selection for gender and age group, we need data. The data needs to include purchasing trends, and general market data for energy drinks. I assume that young males, who have enough disposable income (employed) will be a key demographic for this product. So, males, aged between 17 and 30, will be my targeted group (Friis, et al. 2014).

We need to find the regions with the largest male population between those age groups. This will be a matter of organising the original ‘data’ table, and returning a filter of Males, 17< age <31, then organising the return in descending order.

data2 <- read.csv(here("data", "pop_dataset_0002.csv"))

# filter data for males aged 17-31
target_market <- data2[data$gender == "M" & data2$age >= 17 & data2$age <= 31,]
str(target_market)
## 'data.frame':    7526 obs. of  4 variables:
##  $ region    : chr  "SSC21184" "SSC21184" "SSC21184" "SSC21184" ...
##  $ age       : int  17 18 19 20 21 22 23 24 25 26 ...
##  $ gender    : chr  "M" "M" "M" "M" ...
##  $ population: int  116 150 120 140 116 122 116 107 109 114 ...
# group by region and sum population
target_market_by_region <- aggregate(target_market$population, by = list(target_market$region), sum)

# re-name columns
colnames(target_market_by_region) <- c("region", "population")

# sort by population in descending order
target_market_by_region <- target_market_by_region[order(-target_market_by_region$population),]

# display top 2 regions, naturally displayed at the top of a head table
top_2_regions <- head(target_market_by_region, 2)
top_2_regions

I would target my resources at regions SSC20492 and SSC22015 (the same region we’ve been investigating through the project so far, out of interest). I would do this because the brief literature suggested that Males with disposable income are the greatest consumers for this kind of product.

#estimating the number of the two regions' population who will attend if at first 15% attendance, and then 30% attendance

# Filter data for males between the ages of 17 and 31
target_pop <- data %>% 
  filter(gender == "M", age >= 17, age <= 31)

# Find the total population of the target market in each region
target_pop_by_region <- target_pop %>% 
  group_by(region) %>% 
  summarize(total_pop = sum(population)) %>%
  arrange(desc(total_pop))

# Estimate number of target population who will attend if 15% attendance
attendance_15_SSC20492 <- 5344 * 0.15
attendance_15_ssc22015 <- 4889 * 0.15


# Estimate number of target population who will attend if 30% attendance
attendance_30_SSC20492 <- 5344 * 0.15
attendance_30_ssc22015 <- 4889 * 0.15

# Print the results
cat("\n\nThe estimated number of attendees at 15% attendance is:")
## 
## 
## The estimated number of attendees at 15% attendance is:
attendance_15_SSC20492
## [1] 801.6
attendance_15_ssc22015
## [1] 733.35
cat("\n\nThe estimated number of attendees at 30% attendance is:")
## 
## 
## The estimated number of attendees at 30% attendance is:
attendance_30_SSC20492
## [1] 801.6
attendance_30_ssc22015
## [1] 733.35

Bibliography

Karina Friis, Jeppe I. Lyng, Mathias Lasgaard, Finn B. Larsen, Energy drink consumption and the relation to socio-demographic factors and health behaviour among young adults in Denmark. A population-based study, European Journal of Public Health, Volume 24, Issue 5, October 2014, Pages 840–844, https://doi.org/10.1093/eurpub/cku003