Section 1:

Describe three major data and design challenges faced in accomplishing the task. (15 marks)

  1. The first major design challenge is the data preparation in getting the data for % of population per town

  2. I faced difficulty in plotting the pyramid plot because most of the plots that came out only came out either Male side graph or female side graph. It was absolutely challenging to find suitable references online. For instance, this website gave me a “male” or “female” side plot only despite following the steps carefully: https://klein.uk/teaching/viz/datavis-pyramids/

  3. I also faced incompatible version issues when trying to use codes online, such as for lines like :" geom_bar(subset = .(Gender == “Male”), stat = “identity”) “, I saw online that”subset" is not compatible with newest GGplot2 version.

  4. Had much difficulty in setting the size of the plot to make the facet plots less compressed and visible.

  5. Had much difficulty in making the Age sex pyramid age range from small to big because “5-9” will always come after “45-49” and before “50-54”. It was very hard to find a way to manipulate the data like this in R.

Section 2

Suggest ways to overcome the challenges

(Resolve (1)): Did a series of calculations using the dplyr tool to do a pipeline in order to get the % of population per town. I had made several careless mistakes at first which caused my population in age 90+ to show > 100%. I used this series of code to get the result:

df <- aggregate(Pop~PA+AG+Sex+Time,pop_data,sum) %>%
   filter(Time == 2019) %>%
   group_by(PA) %>%
   mutate(PAsum=sum(Pop)) %>%
   group_by(PA) %>%
   mutate(pct_Pop=Pop*100/PAsum) %>%
   arrange(PA) %>%
   mutate(PercentageByPA= (Pop/PAsum)*100)

After that i removed some columns for easier viewing.

(Resolve (2) and(3)): I explored this other link https://stackoverflow.com/questions/26724458/r-how-to-add-facet-labels-for-pyramid-like-plot-in-ggplot2. The main important thing is that i cannot use “geom_bar(subset = .(Gender ==”Male“), stat =”identity“)” since “subset” argument is not available in the current GGplot2, according to online sources :https://stackoverflow.com/questions/34588232/subset-parameter-in-layers-is-no-longer-working-with-ggplot2-2-0-0, so in this case i changed my code base. (codes in the step by step guide below)

(Resolve (4)): There is a quick short cut, just need to increase the size of the WHOLE plot. To do this, we change the knitr command to

knitr::opts_chunk$set(fig.width = 15 , fig.height = 15,echo = TRUE)

We just change the “fid.width” and “fig.height”

(Resolve (5)): To make the numbers flow correctly from small to big, i changed inside the csv file. For instance, change “5_to_9” to “05-9”.

The proposed design (Sketch)

image: Proposed Sketch

Section 3: Step By Step Guide

Import Packages

For this exercise, two main R packages will be used in this hands-on exercise, they are:

  • tidyverse, The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
  • ggthemes, an R package for creating plots with different background themes
  • kableExtra, The goal of kableExtra is to help you build common complex tables and manipulate table styles.
packages = c('tidyverse','ggthemes', 'kableExtra')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

Import dataSet

The Singapore Residents by Planning AreaSubzone, Age Group, Sex and Type of Dwelling, June 2011-2019 dataset will be used. it is available at [https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data]

The R datatable is called “pop_data”

Preparing the data

Next, use the mutate() function of dplyr package to derive three new measures namely : pct_Pop, PAsum and PercentageByPA. We will use PercentageByPA as the dependent variable in our graph. PAsum = the population sum by Planning Area (PA). pct_Pop is a intermediate value not used in final variables. PercentageByPA = the percentage of each age group within the population by PA. We have to make sure the data is filtered for year 2019 only.

# create the dataset for pyramid chart
df <- aggregate(Pop~PA+AG+Sex+Time,pop_data,sum) %>%
   filter(Time == 2019) %>%
   group_by(PA) %>%
   mutate(PAsum=sum(Pop)) %>%
   group_by(PA) %>%
   mutate(pct_Pop=Pop*100/PAsum) %>%
   arrange(PA) %>%
   mutate(PercentageByPA= (Pop/PAsum)*100) %>%
   select (-c(4,5,6,7)) #%>% # remove 3 columns
   #filter(PA =="Ang Mo Kio")

df
## # A tibble: 2,090 x 4
## # Groups:   PA [55]
##    PA         AG    Sex     PercentageByPA
##    <fct>      <fct> <fct>            <dbl>
##  1 Ang Mo Kio 00-4  Females           1.62
##  2 Ang Mo Kio 05-9  Females           1.89
##  3 Ang Mo Kio 10-14 Females           2.23
##  4 Ang Mo Kio 15-19 Females           2.37
##  5 Ang Mo Kio 20-24 Females           2.67
##  6 Ang Mo Kio 25-29 Females           3.29
##  7 Ang Mo Kio 30-34 Females           3.31
##  8 Ang Mo Kio 35-39 Females           3.53
##  9 Ang Mo Kio 40-44 Females           3.83
## 10 Ang Mo Kio 45-49 Females           3.99
## # ... with 2,080 more rows

It was a little bit easier if you "filter(PA==‘Ang Mo Kio’) before you try to plot a single Age sex pyramid. Or else it will not work because all the PA are used and you will get the PA for whole population which will cause the % for some groups to sum to more than 100%. For testing, do this filter first. When you reach the facet_wrap, only then remove this filter.

Impute NA with 0

This helps the programme to not throw any errors when we are plotting the graph that does contain “NA” values.

df[is.na(df)] = 0

Plot the Pyramid plot

Prepare the layers, explanation of the codes are in the comments below:

xmi <- -5
xma <- 5

# add the basic layers and plot bar graphs for both "females" and "males"
p = ggplot(data = df, aes(x = AG, fill = Sex)) +
    geom_bar(stat = "identity", data = subset(df, Sex == "Females"), aes(y = PercentageByPA)) +
    geom_text(data = subset(df, Sex == "Females"), aes(y = PercentageByPA, label = AG), 
      size = 4, hjust = -0.1) +
    geom_bar(stat = "identity", data = subset(df, Sex == "Males"), aes(y=PercentageByPA * (-1)) ) +
  
  # facet wrap to make multiple plots align in one graph
  facet_wrap(~ PA, , nrow=5) +
    geom_text(data = subset(df, Sex == "Male"), aes(y = PercentageByPA * (-1), label = AG), 
      size = 4, hjust = 1.1) +
  scale_y_continuous(limits = c(xmi, xma), breaks = seq(xmi, xma, 10), labels = abs(seq(xmi, xma, 10))) + 
  theme(axis.text = element_text(colour = "black")) + 
  
  # flip one of the coordinates to make it like a pyramid
  coord_flip() + 
  ylab("") + xlab("") + guides(fill = FALSE) +
  theme(plot.margin = unit(c(1, 1, 1, 1), "lines")) +
  
  # change the theme color to "economist"
  theme_economist() + scale_color_economist()+
  
  # add Header title and Axis titles
  ggtitle("Age Sex Pyramid by Age and Planning Area in year 2019")+
  theme(plot.title = element_text(size = 38, face = "bold",hjust = 0.5))+
  labs(y = "Population in % ", x = "Age Group", size = 15)
## Warning: Factor `PA` contains implicit NA, consider using
## `forcats::fct_explicit_na`

Here we are doing some kind of enhancements to the graph with tables format for the display of the facet_wrap borders and subheaders of towns.

Section 4: Final Graph

# Construct the strip
library(grid)

strip = gTree(name = "Strip", 
   children = gList(
     rectGrob(gp = gpar(col = NA, fill = "grey85")),
     textGrob("Female", x = .75, gp = gpar(fontsize = 5, col = "grey10")), 
     textGrob("Male", x = .25, gp = gpar(fontsize = 5, col = "grey10")),
     linesGrob(x = .5, gp = gpar(col = "grey95"))))

# Position strip using annotation_custom
p1 = p + annotation_custom(strip, xmin = Inf, xmax = 21, ymax = Inf, ymin = -Inf) 

g = ggplotGrob(p1)

# The strip is positioned outside the panel,
# therefore turn off clipping to the panel.
g$layout[g$layout$name=='panel', "clip"] = "off"

# Draw it
grid.newpage()
grid.draw(g)

This graph shows the distribution of age groups against % of each age group within each Planning Area. Each of the subgraphs are drawn based on each Planning Area.

Section 5 Major Advantages of building R visualisation vs Tableau