In this project, we want to reveal the demographic structure of Singapore population by age cohort (i.e. 0-4, 5-9,……) and by planning area in 2019.
After went through the data and requirements, I decided to use population pyramid to show the demographic structure of Singapore population by age cohort, and to use a stacked bar chart and Ternary Plot to show the demographic structure of Singapore population by planning area. Here are the drafts for the three plots.
Singapore Population Pyramid 2019
Singapore Population Structure 2019 (Stacked Bar Chart in percentage)
Singapore Population Structure 2019 (Ternary Plot)
packages = c('ggtern', 'plotly', 'tidyverse')
for (p in packages){
if(!require(p,character.only = T)){install.packages(p)
}
library(p,character.only = T)
}
In the makeover 8, Geospatial statistics will be used. Please Download the data here.The original data set is in csv format.
In the code chunk below, read_csv() of readr is used to import respopagesextod2011to2019.csv into R and parsed it into tibble R data frame format.
SG_POP_data = read_csv("data/respopagesextod2011to2019.csv")
The output tibbled data frame is called SG_POP_data.
Next, we need to rename headers of the data and replace " to " in Age as below.
## rename the headers of the data
names(SG_POP_data)[1:7] =
c("Planning_Area", "Subzone", "Age", "Gender", "Housing_Type","Population","Year")
## replace " to " to "-"
SG_POP_data$Age <- str_replace_all(SG_POP_data$Age, " to ", "-")
## filter out 0 in population
SG_POP_data <- subset(SG_POP_data,SG_POP_data$Population>0)
SG_POP_data_2019 = subset(SG_POP_data,Year == "2019")
## Show the first 5 rows in the data
head(SG_POP_data_2019)
## # A tibble: 6 x 7
## Planning_Area Subzone Age Gender Housing_Type Population Year
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Ang Mo Kio Ang Mo Kio~ 0-4 Males HDB 3-Room Flats 10 2019
## 2 Ang Mo Kio Ang Mo Kio~ 0-4 Males HDB 4-Room Flats 10 2019
## 3 Ang Mo Kio Ang Mo Kio~ 0-4 Males HDB 5-Room and E~ 20 2019
## 4 Ang Mo Kio Ang Mo Kio~ 0-4 Males Condominiums and~ 50 2019
## 5 Ang Mo Kio Ang Mo Kio~ 0-4 Femal~ HDB 3-Room Flats 10 2019
## 6 Ang Mo Kio Ang Mo Kio~ 0-4 Femal~ HDB 4-Room Flats 10 2019
SG_POP_data_PP = SG_POP_data_2019[,c("Age","Gender","Population")]
## To categorize the Age and store it as levels
SG_POP_data_PP$Age <- factor(SG_POP_data_PP$Age, levels = SG_POP_data_PP$Age,labels = SG_POP_data_PP$Age)
## aggregate the data by gender and age group
SG_POP_data_PP <- aggregate(formula = Population ~ Gender + Age, data = SG_POP_data_PP, FUN = sum)
## Changed the males population into negative value
SG_POP_data_PP$Population <- ifelse(SG_POP_data_PP$Gender == "Males", -1*SG_POP_data_PP$Population, SG_POP_data_PP$Population)
## pyramid charts are two barcharts with axes flipped
pyramidG <- ggplot(SG_POP_data_PP, aes(x = Age, y = Population, fill = Gender))+
geom_bar(data = subset(SG_POP_data_PP, Gender == "Females"), stat ="identity")+
geom_bar(data = subset(SG_POP_data_PP, Gender == "Males"), stat = "identity")+
scale_y_continuous(labels = paste0(as.character(c(seq(2, 0, -1), seq(1, 2, 1))), "m"))+
coord_flip()+
labs(title="Singapore Population Pyramid in 2019")+
theme(plot.title = element_text(hjust = .5),
axis.ticks = element_blank())+
scale_fill_brewer(palette = "Dark2")
pyramidG
4 variables will used in this graph.
SG_POP_data_PS = SG_POP_data_2019[,c("Age","Planning_Area","Subzone","Population")]
ggplot(SG_POP_data_PS, aes(fill=factor(Age), y=Population, x=Planning_Area)) +
geom_bar(position="stack", stat="identity")+
coord_flip()
In the first plot, we find out the length in stack bar chart was not the same and it is difficult for us to compare different areas. Hence, we tried to use percentage bar chart to show the result instead.
ggplot(SG_POP_data_PS, aes(fill=Age, y=Population, x=Planning_Area)) +
geom_bar(position="fill",stat="identity")+
labs(x = "Planning Area", y = "% of Population")+
geom_vline(xintercept=0.5)+
coord_flip()+
ggtitle("Population Structure by Planning Area in 2019")+
theme(plot.title = element_text(hjust = .5),
axis.ticks = element_blank())+
theme_classic()
In this plot, we fill the graph with Age, but the categories were too many to show in the plot and it is not very clear for us to compare the difference. Hence we need to create a new group to show the catagory in color.
## extract the last word in the age column
agenew=c()
for (i in SG_POP_data_PS$Age){
age_last = tail(strsplit(i,split= c("-"," "))[[1]],1)
agenew=c(agenew,age_last)
}
SG_POP_data_PS$Age_New = agenew
## Create a new group in the data, including Young(Age<=24), Economically Active(24<age<=64), Old(Age>64)
Group=c()
for (i in SG_POP_data_PS$Age_New){
if (i=='90 and over'){
i<-100
}
else{
i<-as.numeric(i)
}
if(i <= 24){
Group=c(Group,'Young')}
else if((24 < i) & (i<= 64)) {
Group=c(Group,'Economically Active')}
else {
Group=c(Group,'Old')}
}
SG_POP_data_PS$Group = Group
Group <- SG_POP_data_PS$Group
brks <- c(0, 0.25, 0.5, 0.75, 1)
ggplot(SG_POP_data_PS, aes(fill=Group, y=Population, x=Planning_Area)) +
geom_bar(position="fill",stat="identity")+
scale_y_continuous(breaks = brks, labels = scales::percent(brks)) +
labs(x = "Planning Area", y = "% of Population")+
geom_vline(xintercept=0.5)+
coord_flip()+
ggtitle("Population Structure by Economically Group in 2019")+
theme(plot.title = element_text(hjust = .5),
axis.ticks = element_blank())+
theme_minimal(base_size =10)
Group <- SG_POP_data_PS$Group
brks <- c(0, 0.25, 0.5, 0.75, 1)
ggplot(SG_POP_data_PS, aes(fill=Group, y=Population, x=Planning_Area)) +
geom_bar(position="fill",stat="identity")+
scale_y_continuous(breaks = brks, labels = scales::percent(brks)) +
labs(x = "Planning Area", y = "% of Population")+
geom_vline(xintercept=0.5)+
coord_flip()+
ggtitle("Population Structure by Economically Group in 2019")+
theme(plot.title = element_text(hjust = .5),
axis.ticks = element_blank())+
theme_minimal(base_size =10)+
scale_fill_brewer(palette = "Dark2")
data_sum = aggregate(SG_POP_data_PS$Population, by = list(SG_POP_data_PS$Planning_Area, SG_POP_data_PS$Group), FUN=sum)
data_sum = cbind(data_sum, reshape(data_sum, idvar = c("Group.1", "Group.2"), timevar="Group.2", direction="wide")[,-1])
pl = c()
total = c()
active = c()
old =c()
young = c()
data_area = unique(data_sum$Group.1)
for (i in data_area){
pl=c(pl,i)
g_value = subset(data_sum, Group.1==i, select =c('x'))
g_value2= unlist(g_value)
sum_value = g_value2[1]+g_value2[2]+g_value2[3]
total=c(total, sum_value)
active = c(active,g_value2[1])
old =c(old,g_value2[2])
young = c(young,g_value2[3])
}
df = data.frame(Planning_Area = pl, Total = total, Active = active,Old = old, Young = young)
df$P_Active = df$Active/df$Total
df$P_Young = df$Young/df$Total
df$P_Old = df$Old/df$Total
Use ggtern() function of ggtern package to create a simple ternary plot.
YOUNG = df$P_Young
ACTIVE = df$P_Active
OLD = df$P_Old
ggtern(data=df, aes(x=YOUNG,y=ACTIVE, z=OLD)) +
geom_point()
# reusable function for creating annotation object
label <- function(txt) {
list(
text = txt,
x = 0.1, y = 1,
ax = 0, ay = 0,
xref = "paper", yref = "paper",
align = "center",
font = list(family = "serif", size = 15, color = "white"),
bgcolor = "#b3b3b3", bordercolor = "black", borderwidth = 2
)
}
# reusable function for axis formatting
axis <- function(txt) {
list(
title = txt, tickformat = ".0%", tickfont = list(size = 10)
)
}
ternaryAxes <- list(
aaxis = axis("Young"),
baxis = axis("Active"),
caxis = axis("Old")
)
# Initiating a plotly visualization
plot_ly(
df,
a = ~YOUNG,
b = ~ACTIVE,
c = ~OLD,
color = I("black"),
type = "scatterternary"
) %>%
layout(
annotations = label("Ternary Markers"),
ternary = ternaryAxes
)
In the process, we may meet some challenges, hence we provide solution as below.
Challenge 1: Age is character format and then to fill Age into ggplot, it was not in the right order. e.g. 5-9 was in the middle of the picture, but it should be after 0-4.
Solution: Transfer Age to a factor and save it to levels, and then put it into ggplot and make an aggregation.
Challenge 2: It was not easy to put two bar charts together in R. I created two bar charts, one for females and another for males, but they were superimposed.
Solution: Convert the population of Males into negative values and created a fake X-axis which was sperate by 0 million.
Challenge 3: It was not easy to transfer Age into Three groups, because age is in character format and has too many categories. Especially ’90 and over’ was different from others.
Solution: At first, we converted ‘90 and over’ to 100 and extracted the last number from the Age column. After that, the Age column was in numeric format and we can use it in a for loop to create a new column ‘Group’.
Singapore has relatively balanced sex ratio and most people aged between 25 and 64.
More than 50% of population in all regions are Economically Active Group.
The region with the highest proportion of young people is western water catchment. The region with the highest proportion of Economically Active people is museum.
R is more flexible when changing the format of bar charts. E.g. when we need to change the content of a bar chart from population number to a percentage, we just changed fill in position instead of the stack.
R is more convenient to explore statistical insights, and it can provide more tools for us to analyze.
R has many plot packages or functions. E.g. we need not create an outline by ourselves when we are creating a ternary plot, R provides a function helps us to create a ternary in a quick way.
```