DataViz Makeover 8

Data Preparation

The data used in this exercise is the data provided in Singapore Residents by Planning AreaSubzone, Age Group, Sex and Type of Dwelling, June 2011-2019 data series. It is available digitally here

This code chunk installs the basic tidyverse packages and loads them onto Rstudio environment.

packages <- c('tidyverse')

for(p in packages){
  if(!require(p, character.only=T)){
    install.packages(p)
  }
  library(p,character.only=T)
}
#{r echo=TRUE,eval=TRUE, message=FALSE,warning=FALSE}
#echo makes the code not visible in the knit. eval controls whether your code runs or not

We will be using some themes from the ggthemes package, thus, before you load this package, please ensure you have installed the theme

library(ggthemes)

Read the file

read <- read_csv("data/respopagesextod2011to2019.csv")

Filter the file by year 2019 and then remove the Time column

O1 <- read %>% filter(Time == 2019) %>% select(-Time)

3 major data and design challenges faced in acccomplishing the task

1st Challenge:

Inability to see distributions for smaller PAs due to so many different PAs that need to be compared together using the scale of the largest PA.

Having all the barcharts on the same facet facilitates comparison of total population size of PA by Age Group as both Males and Females are aggregated together, with every single one on the same scale. Notice that the distribution across the age groups is almost impossible to spot for the PAs with low populations.

Population Pyramid for PA however, introduces in another dimension, which is the Male to Female ratio which is important in describing populations. This further reduces the space for each visualisation. In addition, the comparison of 55 facets each containing one PA makes the comparison of PA across even more difficult to compare.

2nd Challenge:

New to data manipulation. Faced difficulty in removing unnecessary stratification of data. Manipulation of dataset. Age 5 to 9 is in the middle of the population data.

3rd Challenge:

Though population pyramids and barcharts across many facets facilitates comparison of absolute population size. The demographic distribution of the population in PA with low populations are not easy to view as there are some PA’s have very very high populations, making the ones with low populations look like they have almost no people at all. Thus, some population pyramids are constructed based on % distribution across age groups. But the flaw of using the % distribution to normalise for size of population is that there are PAs with very low population values, which have very uneven population distributions such as Museum, Lim Chu Kang and Seletar. In this case, comparing these population pyramids by proportion will result in very odd shapes because there are so few people living in those areas that in certain age ranges, it is unrepresented.

Small population size, resulting in no representation in certain age groups

Contrast this to the distribution of population for populations with larger populations

2. Suggest ways to overcome the Challenges identified earlier. Sketch out the proposed design

Solution to 1st Challenge and 3rd Challenge:

In order to facilitate comparison across PAs with different population sizes, PA’s containing different ranges of population sizes will be displayed on different graphs. As the PAs of around the same population size are aggregated together, this allows people to see the demographic distribution of the population easily. With appropriate labels

Solution to 2nd challenge:

original before aggregating data

The function aggregate is used to combine the rows of certain specific columns.

optionA<- aggregate(Pop~PA+AG,O1,sum)

The function group_by is used to manipulate data such that it will only use that certain field, thus you can get the sum of the whole population in a certain PA(Planning Area) in this dataset

optionA<- aggregate(Pop~PA+AG,O1,sum) %>%group_by(PA) %>% mutate(PAsum=sum(Pop))

To get the percentage of people in each age group in a certain PA

optionA<- aggregate(Pop~PA+AG,O1,sum) %>%group_by(PA) %>% mutate(PAsum=sum(Pop))%>%group_by(PA,AG) %>%
   mutate(pct_Pop=Pop*100/PAsum)

To get the population of a PA and compare it against that of the population of the whole Singapore

optionA<- aggregate(Pop~PA+AG,O1,sum) %>%group_by(PA) %>% mutate(PAsum=sum(Pop))%>%group_by(PA,AG) %>%
   mutate(pct_Pop=Pop*100/PAsum)%>%group_by()%>%mutate(PAperc=PAsum/sum(Pop))

Orginal before ordering ages. You can see that ages 5_to_9 is in the center of the population pyramid

solution: have to order the ages manually by arranging the data properly in order, with the youngest at the bottom and the oldest at the top. We thus have to assigns levels to order the factors as seen by the code below.

popGH$AG <- factor(popGH$AG,levels = c("0_to_4","5_to_9","10_to_14","15_to_19","20_to_24","25_to_29","30_to_34","35_to_39","40_to_44","45_to_49","50_to_54","55_to_59","60_to_64","65_to_69","70_to_74","75_to_79","80_to_84","85_to_89","90_and_over"))

Sketch of proposed visualisation

Step-by-step description on how the data visualization was prepared

Step 1

Using the group_by function shared earlier, worked our the total Male and Female populations.

Summarise creates a new column called count which contains the population values.

Spread splits the values in Sex, which are Females and Males into seperate columns, making them into headers.

The second argument in spread assigns the population values to the respective female and male columns.

OptinoAperc<- O1%>% group_by(Sex, AG, PA) %>%
  summarise(count=sum(Pop)) %>%
  spread(Sex, count)  %>% group_by(PA) %>%  mutate(FemPAsum=sum(Females),MalPAsum=sum(Males))%>%arrange(PA)

Step 2

The gather function is used to merge the columns Females and Males back into the sex column.

The ifelse function can be read as if attribute Sex is found to be equal to Males, apply -1*its value, else retain its original value.

popGH<-gather(OptinoAperc,Sex,Pop,-MalPAsum,-FemPAsum,-AG,-PA) %>% group_by(PA)%>%mutate(sumPA=sum(Pop))
popGH$Pop <- ifelse(popGH$Sex == "Males", -1*popGH$Pop, popGH$Pop)

Step 3

The axis has to be ordered and thus the relevant factors have to ordered manually, otherwise the age group 5_to_9 would be somewhere else

popGH$AG <- factor(popGH$AG,levels = c("0_to_4","5_to_9","10_to_14","15_to_19","20_to_24","25_to_29","30_to_34","35_to_39","40_to_44","45_to_49","50_to_54","55_to_59","60_to_64","65_to_69","70_to_74","75_to_79","80_to_84","85_to_89","90_and_over"))

Step 4

Splitting the PA by population size to facilitate better comparison. Do this by using the function filter

popGH0<-popGH%>%filter(sumPA<500)

popGH1<-popGH%>%filter(sumPA>=500,sumPA<5000)

popGH2<-popGH%>%filter(sumPA>=5000,sumPA<75000)

popGH3<-popGH%>%filter(sumPA>=75000,sumPA<150000)

popGH4<-popGH%>%filter(sumPA>=150000,sumPA<200000)

popGH5<-popGH%>%filter(sumPA>=200000)

Step 5

Visualise the population pyramids. Annotations is done at various rows of the code

ggplot(popGH0, aes(x = AG, y = Pop, fill = Sex)) + 
  geom_bar(data = subset(popGH0, Sex == "Females"), stat = "identity") +
  geom_bar(data = subset(popGH0, Sex == "Males"), stat = "identity") + 
  coord_flip()+facet_wrap(~ PA)+  # Flip axes
  labs(title="PA with population size of less than 500",y="Population", x = "Age Group") +
  scale_y_continuous(breaks = seq(-50, 50, 25), 
                     labels = paste0(as.character(c('50','25','0','25','50'))))+ #set the axis intervals
  theme(plot.title = element_text(hjust = .5),
        axis.ticks = element_blank()) +   # Centre plot title
  scale_fill_brewer(palette = "Set1")+    # set the color of the fill of the population pyramid
  theme_bw(base_size = 8)                 # reduce the size of the font so that the population scales are visible

ggplot(popGH1, aes(x = AG, y = Pop, fill = Sex)) + 
  geom_bar(data = subset(popGH1, Sex == "Females"), stat = "identity") +
  geom_bar(data = subset(popGH1, Sex == "Males"), stat = "identity") + 
  coord_flip()+facet_wrap(~ PA)+  # Flip axes
  labs(title="PA with population size between 500 and 5k",y="Population", x = "Age Group") +
  scale_y_continuous(breaks = seq(-250, 250, 50), 
                     labels = paste0(as.character(c('250','200','150','100','50','0','50','100','150','200','250'))))+ #set the axis intervals
  theme(plot.title = element_text(hjust = .5),
        axis.ticks = element_blank()) +   # Centre plot title
  scale_fill_brewer(palette = "Set1")+    # set the color of the fill of the population pyramid
  theme_bw(base_size = 8)                 # reduce the size of the font so that the population scales are visible

ggplot(popGH2, aes(x = AG, y = Pop, fill = Sex)) + 
  geom_bar(data = subset(popGH2, Sex == "Females"), stat = "identity") +
  geom_bar(data = subset(popGH2, Sex == "Males"), stat = "identity") + 
  coord_flip()+facet_wrap(~ PA)+  # Flip axes
  labs(title="PA with population size between 5k and 75k",y="Population", x = "Age Group") + #set the axis intervals
  scale_y_continuous(breaks = seq(-2500, 2500, 500), 
                     labels = paste0(as.character(c('2.5k','2k','1.5k','1k','0.5k','0','0.5k','1k','1.5k','2k','2.5k'))))+ #set the axis intervals
  theme(plot.title = element_text(hjust = .5),
        axis.ticks = element_blank()) +   # Centre plot title
  scale_fill_brewer(palette = "Set1")+    # set the color of the fill of the population pyramid
  theme_bw(base_size = 8)                 # reduce the size of the font so that the population scales are visible

ggplot(popGH3, aes(x = AG, y = Pop, fill = Sex)) + 
  geom_bar(data = subset(popGH3, Sex == "Females"), stat = "identity") +
  geom_bar(data = subset(popGH3, Sex == "Males"), stat = "identity") + 
  coord_flip()+facet_wrap(~ PA)+  # Flip axes
  labs(title="PA with population size between 75k and 150k",y="Population", x = "Age Group") + #set the axis intervals
  scale_y_continuous(breaks = seq(-7500, 7500, 2500), 
                     labels = paste0(as.character(c('7.5k','5k','2.5k','0k','2.5k','5k','7.5k'))))+
  theme(plot.title = element_text(hjust = .5),
        axis.ticks = element_blank()) +   # Centre plot title
  scale_fill_brewer(palette = "Set1")+    # set the color of the fill of the population pyramid
  theme_bw(base_size = 8)                 # reduce the size of the font so that the population scales are visible

ggplot(popGH4, aes(x = AG, y = Pop, fill = Sex)) + 
  geom_bar(data = subset(popGH4, Sex == "Females"), stat = "identity") +
  geom_bar(data = subset(popGH4, Sex == "Males"), stat = "identity") + 
  coord_flip()+facet_wrap(~ PA)+  # Flip axes
  labs(title="PA with population size between 150k to 200k",y="Population", x = "Age Group") +
  scale_y_continuous(breaks = seq(-15000, 15000, 5000), 
                     labels = paste0(as.character(c('15k','10k','5k','0','5k','10k','15k'))))+ #set the axis intervals
  theme(plot.title = element_text(hjust = .5),
        axis.ticks = element_blank()) +   # Centre plot title
  scale_fill_brewer(palette = "Set1")+    # set the color of the fill of the population pyramid
  theme_bw(base_size = 8)                 # reduce the size of the font so that the population scales are visible

ggplot(popGH5, aes(x = AG, y = Pop, fill = Sex)) + 
  geom_bar(data = subset(popGH5, Sex == "Females"), stat = "identity") +
  geom_bar(data = subset(popGH5, Sex == "Males"), stat = "identity") + 
  coord_flip()+facet_wrap(~ PA,ncol=3)+  # Flip axes
  labs(title="PA with population size of more than 200k",y="Population", x = "Age Group") + 
  scale_y_continuous(breaks = seq(-15000, 15000, 5000), 
                     labels = paste0(as.character(c('15k','10k','5k','0k','5k','10k','15k'))))+ #set the axis intervals
  theme(plot.title = element_text(hjust = .5),
        axis.ticks = element_blank()) +   # Centre plot title
  scale_fill_brewer(palette = "Set1")+    # set the color of the fill of the population pyramid
  theme_bw(base_size = 8)                 # reduce the size of the font so that the population scales are visible

Final data visualisation

Useful information revealed by data visualisation

Finding 1

Yishun, Choa Chu Kang, Serangoon, Hougang , Pasir Ris, Tampines have bimodal distributions. With disproportionately higher populations/modes in the age group of 20 to 29 and 55 to 64, . This could mean that there is a large percentage of children who were born to their parents still living with their parents or close to their parents in these estates.

Finding 2

Punggol and Sengkang have disporportionately high amount of 0 to 4 years old and people in their 30s. Sengkang appears to have a higher % of older people living there compared to Punggol. This is interesting because both are actually mature estates. Sengkang only started housing development slightly earlier as compared to Punggol. It could be indicative of certain government policies which shapes this distribution.

Finding 3

Marine Parade and Novena are the only two PAs which possesses a very distinguished mode consisting of people at the age of 45 to 49. Of those aged 50 and above, it forms a triangle as indicative of a rapidly aging population triange. This is indicative of a disproportionately larger amount of older people living in that region. In preparation of the upcoming silver wave, the wave would likely peak after 15 years. From now to 30 years time, there would likely be the most number of deaths in that region with the replacement rate not being able to keep up with the mortality rate. In 20 years time, Marine Parade and Novena could be the site for many new housing developments.

Highlight at least three major advantages of building the data visualisation in R as compared to using Tableau

1st Advantage: Rmarkdown also allows others for better story telling instead of copying and pasting screenshots, one can just show the direct output of the files. As explanation text can be shown next to the code, it facilitates much easier annotation and explanation as compared to copying and pasting a screenshot and then having to insert circles to highlight what you are referring to. R coupled with Shiny, does not lose to Tableau as it allows interactivity with the user. The amount of space used by Tableau files is very large, a R markdown file however is much smaller.

2nd Advantage: Data visualisation in R is more intuitive once one knows how to manipulate data to show what you want it to show. The aggregate function is way more user friendly compared to the pivoting of the data in Tableau.One can also rename the levels of certain attributes directly in R, whereas in Tableau, you need to create new grouping and then rename some of them into different categories.

3rd Advantage: Calculated fields in Tableau is difficult to manipulate and produce, where some of the sum, count and average functions need to be of the same aggregation level before they can be manipulated together. Using the group_by and mutate functions in R, it is easier to manipulate data without seeing errors and use it to reflect whatever data i want. I often waste a lot of time trying to figure out why some calculated field formulas don’t work in Tableau, where it could be calculating across the pane or across the field or across some weird attribute without you knowing. I would also have to constantly keep one worksheet with the calculated fields output to ensure that what I am doing is correct. In R, I can easily refer to my output by clicking on the objects saved inside the global environment.

4th Advantage: Easier to learn.There are more informative articles regarding how to troubleshoot errors in R as compared to Tableau. There are literally no examples of how the formula is supposed to be used in Tableau. In R, there is the ?function command where i can call in order to get a description on how to use the certain function and some solid examples on how to use it too.

5th Advantage: Level of customisation in R is endless, there exists plenty of packages in R that can be used to plot all sorts of graphs.