1.0 Overview

In this project, we want to reveal the demographic structure of Singapore population by age cohort (i.e. 0-4, 5-9,……) and by planning area in 2019.

1.1 Proposed Design

After went through the data and requirements, I decided to use population pyramid to show the demographic structure of Singapore population by age cohort, and to use a stacked bar chart and Ternary Plot to show the demographic structure of Singapore population by planning area. Here are the drafts for the three plots.

Singapore Population Pyramid 2019

Singapore Population Pyramid 2019

Singapore Population Structure 2019 (Stacked Bar Chart in percentage)

Singapore Population Structure 2019 (Stacked Bar Chart in percentage)

Singapore Population Structure 2019 (Ternary Plot)

Singapore Population Structure 2019 (Ternary Plot)

2.0 Installing and Launching R Packages

packages = c('ggtern', 'plotly', 'tidyverse')
for (p in packages){
  if(!require(p,character.only = T)){install.packages(p)
  }
  library(p,character.only = T)
}

3.0 Importing and Preparing The Data Set

In the makeover 8, Geospatial statistics will be used. Please Download the data here.The original data set is in csv format.

3.1 Importing the data set

In the code chunk below, read_csv() of readr is used to import respopagesextod2011to2019.csv into R and parsed it into tibble R data frame format.

SG_POP_data = read_csv("data/respopagesextod2011to2019.csv")

The output tibbled data frame is called SG_POP_data.

3.2 Preparing the data

Next, we need to rename headers of the data and replace " to " in Age as below.

## rename the headers of the data 
names(SG_POP_data)[1:7] =
  c("Planning_Area", "Subzone", "Age", "Gender", "Housing_Type","Population","Year")

## replace " to " to "-"
SG_POP_data$Age <- str_replace_all(SG_POP_data$Age, " to ", "-")

## filter out 0 in population 
SG_POP_data <- subset(SG_POP_data,SG_POP_data$Population>0)

3.3 Extracting the data in 2019

SG_POP_data_2019 = subset(SG_POP_data,Year == "2019")

## Show the first 5 rows in the data 
head(SG_POP_data_2019)
## # A tibble: 6 x 7
##   Planning_Area Subzone     Age   Gender Housing_Type      Population  Year
##   <chr>         <chr>       <chr> <chr>  <chr>                  <dbl> <dbl>
## 1 Ang Mo Kio    Ang Mo Kio~ 0-4   Males  HDB 3-Room Flats          10  2019
## 2 Ang Mo Kio    Ang Mo Kio~ 0-4   Males  HDB 4-Room Flats          10  2019
## 3 Ang Mo Kio    Ang Mo Kio~ 0-4   Males  HDB 5-Room and E~         20  2019
## 4 Ang Mo Kio    Ang Mo Kio~ 0-4   Males  Condominiums and~         50  2019
## 5 Ang Mo Kio    Ang Mo Kio~ 0-4   Femal~ HDB 3-Room Flats          10  2019
## 6 Ang Mo Kio    Ang Mo Kio~ 0-4   Femal~ HDB 4-Room Flats          10  2019

4.0 Population Pyramid in 2019

4.1 Extracting the columns for Population Pyramid

SG_POP_data_PP = SG_POP_data_2019[,c("Age","Gender","Population")]

4.2 Data Preparing before drawing plot

## To categorize the Age and store it as levels
SG_POP_data_PP$Age <- factor(SG_POP_data_PP$Age, levels = SG_POP_data_PP$Age,labels = SG_POP_data_PP$Age)

## aggregate the data by gender and age group
SG_POP_data_PP <- aggregate(formula = Population ~ Gender + Age, data = SG_POP_data_PP, FUN = sum)

## Changed the males population into negative value  
SG_POP_data_PP$Population <- ifelse(SG_POP_data_PP$Gender == "Males", -1*SG_POP_data_PP$Population, SG_POP_data_PP$Population)

4.3 Drawing the graph for Population Pyramid

## pyramid charts are two barcharts with axes flipped
pyramidG <- ggplot(SG_POP_data_PP, aes(x = Age, y = Population, fill = Gender))+
  geom_bar(data = subset(SG_POP_data_PP, Gender == "Females"), stat ="identity")+
  geom_bar(data = subset(SG_POP_data_PP, Gender == "Males"), stat = "identity")+ 
  scale_y_continuous(labels = paste0(as.character(c(seq(2, 0, -1), seq(1, 2, 1))), "m"))+
  coord_flip()+
  labs(title="Singapore Population Pyramid in 2019")+
  theme(plot.title = element_text(hjust = .5),
                axis.ticks = element_blank())+
  scale_fill_brewer(palette = "Dark2")
pyramidG

5.0 Population Structure by Planning Area

5.1 Extracting the colunmns

4 variables will used in this graph.

SG_POP_data_PS = SG_POP_data_2019[,c("Age","Planning_Area","Subzone","Population")]

5.2 Creating the first but simple stacked bar chart by planning area

ggplot(SG_POP_data_PS, aes(fill=factor(Age), y=Population, x=Planning_Area)) + 
    geom_bar(position="stack", stat="identity")+
  coord_flip()

In the first plot, we find out the length in stack bar chart was not the same and it is difficult for us to compare different areas. Hence, we tried to use percentage bar chart to show the result instead.

5.3 Creating a simple stacked bar chart by planning area

ggplot(SG_POP_data_PS, aes(fill=Age, y=Population, x=Planning_Area)) + 
    geom_bar(position="fill",stat="identity")+
  labs(x = "Planning Area", y = "% of Population")+
  geom_vline(xintercept=0.5)+
  coord_flip()+
  ggtitle("Population Structure by Planning Area in 2019")+
  theme(plot.title = element_text(hjust = .5),
                axis.ticks = element_blank())+
  theme_classic()

In this plot, we fill the graph with Age, but the categories were too many to show in the plot and it is not very clear for us to compare the difference. Hence we need to create a new group to show the catagory in color.

5.4 Creating a group with “Young”, “Economically Active” and “old”

## extract the last word in the age column
agenew=c()

for (i in SG_POP_data_PS$Age){
   age_last = tail(strsplit(i,split= c("-"," "))[[1]],1)
   agenew=c(agenew,age_last)
}

SG_POP_data_PS$Age_New = agenew

## Create a new group in the data, including Young(Age<=24), Economically Active(24<age<=64), Old(Age>64)
    
Group=c()

for (i in SG_POP_data_PS$Age_New){
  if (i=='90 and over'){
    i<-100
  }
  else{
     i<-as.numeric(i) 
  }
  if(i <= 24){
    Group=c(Group,'Young')}
  else if((24 < i) & (i<= 64)) {
    Group=c(Group,'Economically Active')}
  else {
    Group=c(Group,'Old')}
}

SG_POP_data_PS$Group = Group

5.5 Creating a percentage bar chart by new group

Group <- SG_POP_data_PS$Group
brks <- c(0, 0.25, 0.5, 0.75, 1)
ggplot(SG_POP_data_PS, aes(fill=Group, y=Population, x=Planning_Area)) + 
    geom_bar(position="fill",stat="identity")+
  scale_y_continuous(breaks = brks, labels = scales::percent(brks)) +
  labs(x = "Planning Area", y = "% of Population")+
  geom_vline(xintercept=0.5)+
  coord_flip()+
  ggtitle("Population Structure by Economically Group in 2019")+
  theme(plot.title = element_text(hjust = .5),
                axis.ticks = element_blank())+
  theme_minimal(base_size =10)

5.6 Creating a improved percentage bar chart by new color

Group <- SG_POP_data_PS$Group
brks <- c(0, 0.25, 0.5, 0.75, 1)
ggplot(SG_POP_data_PS, aes(fill=Group, y=Population, x=Planning_Area)) +
    geom_bar(position="fill",stat="identity")+
  scale_y_continuous(breaks = brks, labels = scales::percent(brks)) +
  labs(x = "Planning Area", y = "% of Population")+
  geom_vline(xintercept=0.5)+
  coord_flip()+
  ggtitle("Population Structure by Economically Group in 2019")+
  theme(plot.title = element_text(hjust = .5),
                axis.ticks = element_blank())+
  theme_minimal(base_size =10)+
  scale_fill_brewer(palette = "Dark2")

6.0 Population Structure by Planning Area

6.1 Transfer numbers into different columns.

data_sum = aggregate(SG_POP_data_PS$Population, by = list(SG_POP_data_PS$Planning_Area, SG_POP_data_PS$Group), FUN=sum)

data_sum = cbind(data_sum, reshape(data_sum, idvar = c("Group.1", "Group.2"), timevar="Group.2", direction="wide")[,-1])

pl = c()
total = c()
active = c()
old =c()
young = c()

data_area = unique(data_sum$Group.1)

for (i in data_area){
  pl=c(pl,i)
  g_value = subset(data_sum, Group.1==i, select =c('x'))
  g_value2= unlist(g_value)
  sum_value = g_value2[1]+g_value2[2]+g_value2[3]
  total=c(total, sum_value)
  active = c(active,g_value2[1])
  old =c(old,g_value2[2])
  young = c(young,g_value2[3])
}

df = data.frame(Planning_Area = pl, Total = total, Active = active,Old = old, Young = young)

df$P_Active = df$Active/df$Total
df$P_Young = df$Young/df$Total
df$P_Old = df$Old/df$Total

6.2 Plotting a static ternary diagram

Use ggtern() function of ggtern package to create a simple ternary plot.

YOUNG = df$P_Young
ACTIVE = df$P_Active
OLD = df$P_Old

ggtern(data=df, aes(x=YOUNG,y=ACTIVE, z=OLD)) +
  geom_point()

6.3 Plotting an interative ternary diagram

# reusable function for creating annotation object
label <- function(txt) {
  list(
    text = txt, 
    x = 0.1, y = 1,
    ax = 0, ay = 0,
    xref = "paper", yref = "paper", 
    align = "center",
    font = list(family = "serif", size = 15, color = "white"),
    bgcolor = "#b3b3b3", bordercolor = "black", borderwidth = 2
  )
}

# reusable function for axis formatting
axis <- function(txt) {
  list(
    title = txt, tickformat = ".0%", tickfont = list(size = 10)
  )
}

ternaryAxes <- list(
  aaxis = axis("Young"), 
  baxis = axis("Active"), 
  caxis = axis("Old")
)

# Initiating a plotly visualization 
plot_ly(
  df, 
  a = ~YOUNG, 
  b = ~ACTIVE, 
  c = ~OLD, 
  color = I("black"), 
  type = "scatterternary"
) %>%
  layout(
    annotations = label("Ternary Markers"), 
    ternary = ternaryAxes
  )

7.Conclusion

7.1 Challenges and Solutions

In the process, we may meet some challenges, hence we provide solution as below.

Challenge 1: Age is character format and then to fill Age into ggplot, it was not in the right order. e.g. 5-9 was in the middle of the picture, but it should be after 0-4.

Solution: Transfer Age to a factor and save it to levels, and then put it into ggplot and make an aggregation.

Challenge 2: It was not easy to put two bar charts together in R. I created two bar charts, one for females and another for males, but they were superimposed.

Solution: Convert the population of Males into negative values and created a fake X-axis which was sperate by 0 million.

Challenge 3: It was not easy to transfer Age into Three groups, because age is in character format and has too many categories. Especially ’90 and over’ was different from others.

Solution: At first, we converted ‘90 and over’ to 100 and extracted the last number from the Age column. After that, the Age column was in numeric format and we can use it in a for loop to create a new column ‘Group’.

7.2 Information Revealed by Proposed Data Visualisation

  1. Singapore has relatively balanced sex ratio and most people aged between 25 and 64.

  2. More than 50% of population in all regions are Economically Active Group.

  3. The region with the highest proportion of young people is western water catchment. The region with the highest proportion of Economically Active people is museum.

7.3 Information Revealed by Proposed Data Visualisation

  1. R is more flexible when changing the format of bar charts. E.g. when we need to change the content of a bar chart from population number to a percentage, we just changed fill in position instead of the stack.

  2. R is more convenient to explore statistical insights, and it can provide more tools for us to analyze.

  3. R has many plot packages or functions. E.g. we need not create an outline by ourselves when we are creating a ternary plot, R provides a function helps us to create a ternary in a quick way.

```