Section 1:

Describe three major data and design challenges faced in accomplishing the task. (15 marks)

The first major design challenge is the data preparation in getting the data for % of population per town
I faced difficulty in plotting the pyramid plot because most of the plots that came out only came out either Male side graph or female side graph. It was absolutely challenging to find suitable references online. For instance, this website gave me a “male” or “female” side plot only despite following the steps carefully: https://klein.uk/teaching/viz/datavis-pyramids/
I also faced incompatible version issues when trying to use codes online, such as for lines like :" geom_bar(subset = .(Gender == “Male”), stat = “identity”) “, I saw online that”subset" is not compatible with newest GGplot2 version.
Had much difficulty in setting the size of the plot to make the facet plots less compressed and visible.
Had much difficulty in making the Age sex pyramid age range from small to big because “5-9” will always come after “45-49” and before “50-54”. It was very hard to find a way to manipulate the data like this in R.

Section 2

Suggest ways to overcome the challenges

(Resolve (1)): Did a series of calculations using the dplyr tool to do a pipeline in order to get the % of population per town. I had made several careless mistakes at first which caused my population in age 90+ to show > 100%. I used this series of code to get the result:

df <- aggregate(Pop~PA+AG+Sex+Time,pop_data,sum) %>%
   filter(Time == 2019) %>%
   group_by(PA) %>%
   mutate(PAsum=sum(Pop)) %>%
   group_by(PA) %>%
   mutate(pct_Pop=Pop*100/PAsum) %>%
   arrange(PA) %>%
   mutate(PercentageByPA= (Pop/PAsum)*100)

After that i removed some columns for easier viewing.

(Resolve (2) and(3)): I explored this other link https://stackoverflow.com/questions/26724458/r-how-to-add-facet-labels-for-pyramid-like-plot-in-ggplot2. The main important thing is that i cannot use “geom_bar(subset = .(Gender ==”Male“), stat =”identity“)” since “subset” argument is not available in the current GGplot2, according to online sources :https://stackoverflow.com/questions/34588232/subset-parameter-in-layers-is-no-longer-working-with-ggplot2-2-0-0, so in this case i changed my code base. (codes in the step by step guide below)

(Resolve (4)): There is a quick short cut, just need to increase the size of the WHOLE plot. To do this, we change the knitr command to

knitr::opts_chunk$set(fig.width = 15 , fig.height = 15,echo = TRUE)

We just change the “fid.width” and “fig.height”

(Resolve (5)): To make the numbers flow correctly from small to big, i changed inside the csv file. For instance, change “5_to_9” to “05-9”.

The proposed design (Sketch)

image: Proposed Sketch

Section 3: Step By Step Guide

Import Packages

For this exercise, two main R packages will be used in this hands-on exercise, they are:

tidyverse, The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
ggthemes, an R package for creating plots with different background themes
kableExtra, The goal of kableExtra is to help you build common complex tables and manipulate table styles.

packages = c('tidyverse','ggthemes', 'kableExtra')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

Import dataSet

The Singapore Residents by Planning AreaSubzone, Age Group, Sex and Type of Dwelling, June 2011-2019 dataset will be used. it is available at [https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data]

The R datatable is called “pop_data”

Preparing the data

Next, use the mutate() function of dplyr package to derive three new measures namely : pct_Pop, PAsum and PercentageByPA. We will use PercentageByPA as the dependent variable in our graph. PAsum = the population sum by Planning Area (PA). pct_Pop is a intermediate value not used in final variables. PercentageByPA = the percentage of each age group within the population by PA. We have to make sure the data is filtered for year 2019 only.

# create the dataset for pyramid chart
df <- aggregate(Pop~PA+AG+Sex+Time,pop_data,sum) %>%
   filter(Time == 2019) %>%
   group_by(PA) %>%
   mutate(PAsum=sum(Pop)) %>%
   group_by(PA) %>%
   mutate(pct_Pop=Pop*100/PAsum) %>%
   arrange(PA) %>%
   mutate(PercentageByPA= (Pop/PAsum)*100) %>%
   select (-c(4,5,6,7)) #%>% # remove 3 columns
   #filter(PA =="Ang Mo Kio")

df

## # A tibble: 2,090 x 4
## # Groups:   PA [55]
##    PA         AG    Sex     PercentageByPA
##    <fct>      <fct> <fct>            <dbl>
##  1 Ang Mo Kio 00-4  Females           1.62
##  2 Ang Mo Kio 05-9  Females           1.89
##  3 Ang Mo Kio 10-14 Females           2.23
##  4 Ang Mo Kio 15-19 Females           2.37
##  5 Ang Mo Kio 20-24 Females           2.67
##  6 Ang Mo Kio 25-29 Females           3.29
##  7 Ang Mo Kio 30-34 Females           3.31
##  8 Ang Mo Kio 35-39 Females           3.53
##  9 Ang Mo Kio 40-44 Females           3.83
## 10 Ang Mo Kio 45-49 Females           3.99
## # ... with 2,080 more rows

It was a little bit easier if you "filter(PA==‘Ang Mo Kio’) before you try to plot a single Age sex pyramid. Or else it will not work because all the PA are used and you will get the PA for whole population which will cause the % for some groups to sum to more than 100%. For testing, do this filter first. When you reach the facet_wrap, only then remove this filter.

Impute NA with 0

This helps the programme to not throw any errors when we are plotting the graph that does contain “NA” values.

df[is.na(df)] = 0

Plot the Pyramid plot

Prepare the layers, explanation of the codes are in the comments below:

xmi <- -5
xma <- 5

# add the basic layers and plot bar graphs for both "females" and "males"
p = ggplot(data = df, aes(x = AG, fill = Sex)) +
    geom_bar(stat = "identity", data = subset(df, Sex == "Females"), aes(y = PercentageByPA)) +
    geom_text(data = subset(df, Sex == "Females"), aes(y = PercentageByPA, label = AG), 
      size = 4, hjust = -0.1) +
    geom_bar(stat = "identity", data = subset(df, Sex == "Males"), aes(y=PercentageByPA * (-1)) ) +
  
  # facet wrap to make multiple plots align in one graph
  facet_wrap(~ PA, , nrow=5) +
    geom_text(data = subset(df, Sex == "Male"), aes(y = PercentageByPA * (-1), label = AG), 
      size = 4, hjust = 1.1) +
  scale_y_continuous(limits = c(xmi, xma), breaks = seq(xmi, xma, 10), labels = abs(seq(xmi, xma, 10))) + 
  theme(axis.text = element_text(colour = "black")) + 
  
  # flip one of the coordinates to make it like a pyramid
  coord_flip() + 
  ylab("") + xlab("") + guides(fill = FALSE) +
  theme(plot.margin = unit(c(1, 1, 1, 1), "lines")) +
  
  # change the theme color to "economist"
  theme_economist() + scale_color_economist()+
  
  # add Header title and Axis titles
  ggtitle("Age Sex Pyramid by Age and Planning Area in year 2019")+
  theme(plot.title = element_text(size = 38, face = "bold",hjust = 0.5))+
  labs(y = "Population in % ", x = "Age Group", size = 15)

## Warning: Factor `PA` contains implicit NA, consider using
## `forcats::fct_explicit_na`

Here we are doing some kind of enhancements to the graph with tables format for the display of the facet_wrap borders and subheaders of towns.

Section 4: Final Graph

# Construct the strip
library(grid)

strip = gTree(name = "Strip", 
   children = gList(
     rectGrob(gp = gpar(col = NA, fill = "grey85")),
     textGrob("Female", x = .75, gp = gpar(fontsize = 5, col = "grey10")), 
     textGrob("Male", x = .25, gp = gpar(fontsize = 5, col = "grey10")),
     linesGrob(x = .5, gp = gpar(col = "grey95"))))

# Position strip using annotation_custom
p1 = p + annotation_custom(strip, xmin = Inf, xmax = 21, ymax = Inf, ymin = -Inf) 

g = ggplotGrob(p1)

# The strip is positioned outside the panel,
# therefore turn off clipping to the panel.
g$layout[g$layout$name=='panel', "clip"] = "off"

# Draw it
grid.newpage()
grid.draw(g)

This graph shows the distribution of age groups against % of each age group within each Planning Area. Each of the subgraphs are drawn based on each Planning Area.

Places like Marina East, Pioneer, Tuas have zero population, which is not out of the expectation because Marina East and Tuas are places with industrial parks and financial towers, mainly for businesses and not for staying. For Pioneer, NTU is built there and spans 200 hectares. It is not surprising NTU took up the whole of Pioneer.
Punggol has a younger population group because the top the of pyramid is very slim which suggests that it has very few aged population. It is not surprising as BTOs are recently built at Punggols targeting at fresh married couples as well as single adults over the age of 35 are eligible to buy BTO as well.
The Museum Planning Area is a planning area located in the Central Area of the Central Region of Singapore. The area plays a “bridging role” between the Orchard area and the Downtown Core, which necessitates proper transport networks for vehicles, pedestrians and public transport. It is the central region of Singapore, not surprising to have some people living there, most probably the active population since this region is generally more expensive to stay at.
The graph I plotted is not age group against the actual population because we are interesed in comparing between Planning Areas. Some planning areas may have more living estates and hence the figures will be bigger which is unfair for comparisons.
Lim Chu Kang is where the Singapore farms are at and there are a lot of wetland reserves there. Hence there is no people living in that area.

Section 5 Major Advantages of building R visualisation vs Tableau

The first major advantage is that the facet graphs can be split into rows. In tableau, it can only be squeezed into 1 row with many columns. In order to view all the plots, users have to scroll across the columns side ways which makes the visual less user friendly as we cannot view all the plots at one glance.
The Second major advantage is I can use ggtheme to set the background to look like other famous magazines without having to do it manually in Tableau. For instance : find the right colour palette and fonts etc.
Although it is tricky to code the graph as it leads to a lot of debugging when error arises, the steps taken to reach the final product requires lesser steps to do so. for instance, i only have to type one line of code to get facet_wrap. If i want to display so many plots within one page i have to plot many sets of maybe 11 graphs (and do for 5 times) and align them neatly in containers in the dashboard to put them in blocks to look like (11x5) blocks like the one I produced in R.
Data manipulation was also easier with dplyr in R because i only have to type 1 code block to get the final table. But for Tableau, we have to code many calculated fields and have to call other calculated fields to reach the final datatable. When too many calculated fields are created, it can get quite confusing and we have to open them one by one to check what code it is inside. With R, I only have to print the output by commenting out other lines that I dont need, in order to see the datatable produced at each step of the way.

Other references

https://klein.uk/teaching/viz/datavis-pyramids/ ( did not work for this version of GGplot2)
https://stackoverflow.com/questions/26724458/r-how-to-add-facet-labels-for-pyramid-like-plot-in-ggplot2 (main reference for this example)
https://stackoverflow.com/questions/33832047/pyramid-plot-in-r
https://rmarkdown.rstudio.com/lesson-3.html * https://stackoverflow.com/questions/35458230/how-to-increase-the-font-size-of-ggtitle-in-ggplot2 (add header and change size)

AgeSexPyramid_DataVizMO8

Chan Jia Yi

3/12/2020