1. Overview

In order to explore the demographic structure of Singapore Population from different parts, we create three plots to visualize data analytics.

Population Pyramid

A graph to show distribution of population by age cohort, when the population is growing, it forms a shape of pyramid.

Ternary Plot

A graph displays as a triangle with sides scaled from 0 to 1. Every point on a ternary plot represents a different composition of the three components. A parallel to a side of the triangle is the locus of points representing systems with constant chemical composition in the component situated in the vertex opposed to the side.

Bubble Plot

A graph used to show three-dimensions data, easily help to compare a pair of variables.

2. Installing and Launching R Packages

We use the code chunk below to ensure we install and launch packages we need in the following steps.

packages = c('tidyverse','ggthemes','ggtern','plotly','viridis')

for(p in packages){library
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

3. Importing and Preparing Data Set

Data sources from Singstat link, including characters of Singapore population, such as age, geographic distribution and gender.

3.1 Importing Data

Use read.csv() function to import data into R

sg <- read.csv("data/respopagesextod2011to2019.csv")

3.2 Reading Data

Check the structure of raw data using str() function

str(sg)

## 'data.frame':    883728 obs. of  7 variables:
##  $ PA  : Factor w/ 55 levels "Ang Mo Kio","Bedok",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ SZ  : Factor w/ 323 levels "Admiralty","Airport Road",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ AG  : Factor w/ 19 levels "0_to_4","10_to_14",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Sex : Factor w/ 2 levels "Females","Males": 2 2 2 2 2 2 2 2 1 1 ...
##  $ TOD : Factor w/ 8 levels "Condominiums and Other Apartments",..: 2 3 4 5 6 7 1 8 2 3 ...
##  $ Pop : int  0 10 30 50 0 0 40 0 0 10 ...
##  $ Time: int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...

3.3 Renaming Columns and Reordering Factors

Rename those abbreviations to show the full name of variable.

names(sg)<-c("Zone","Subzone","Age","Gender","Type of Dwelling","Population","Year")
str(sg)

## 'data.frame':    883728 obs. of  7 variables:
##  $ Zone            : Factor w/ 55 levels "Ang Mo Kio","Bedok",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Subzone         : Factor w/ 323 levels "Admiralty","Airport Road",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ Age             : Factor w/ 19 levels "0_to_4","10_to_14",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Gender          : Factor w/ 2 levels "Females","Males": 2 2 2 2 2 2 2 2 1 1 ...
##  $ Type of Dwelling: Factor w/ 8 levels "Condominiums and Other Apartments",..: 2 3 4 5 6 7 1 8 2 3 ...
##  $ Population      : int  0 10 30 50 0 0 40 0 0 10 ...
##  $ Year            : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...

levels(sg$Age)

##  [1] "0_to_4"      "10_to_14"    "15_to_19"    "20_to_24"    "25_to_29"   
##  [6] "30_to_34"    "35_to_39"    "40_to_44"    "45_to_49"    "5_to_9"     
## [11] "50_to_54"    "55_to_59"    "60_to_64"    "65_to_69"    "70_to_74"   
## [16] "75_to_79"    "80_to_84"    "85_to_89"    "90_and_over"

#the order of factors in age is incorrect, reorder the level of factors

sg$Age <- factor(sg$Age,levels(sg$Age)[c(1,10,2:9,11:19)])
levels(sg$Age)

##  [1] "0_to_4"      "5_to_9"      "10_to_14"    "15_to_19"    "20_to_24"   
##  [6] "25_to_29"    "30_to_34"    "35_to_39"    "40_to_44"    "45_to_49"   
## [11] "50_to_54"    "55_to_59"    "60_to_64"    "65_to_69"    "70_to_74"   
## [16] "75_to_79"    "80_to_84"    "85_to_89"    "90_and_over"

4. Population Pyramid

4.1 Preparing Data

Because we only want to show data in 2019, so that we should use filter() function to select data of year 2019.
In the pymarid, we want to split gender variable, to compare difference between gender by age cohort, so that we should use formula to reverse the value of female population.

sg_2019 <- sg %>%
  filter (Year == 2019)
sg_2019$Population <- ifelse(sg_2019$Gender == "Females",-1*sg_2019$Population,sg_2019$Population)

4.2 Creating Plot

Create two bar plots to show different distributions of gender.

Reset the labels of y axis, to make the label value of female population is positive and same intervals of two bar plots.
Use coord_flip() function to reverse the y axis, and design the layout and color to make plot more pretty.
Use theme_economist to make the plot backgroung has more fun.

age_cohort <- ggplot(sg_2019,aes(x = Age, y = Population,fill = Gender))+
  geom_bar(data = subset(sg_2019,Gender == "Females"), stat = "identity")+
      geom_bar(data = subset(sg_2019,Gender == "Males"), stat = "identity") +
  scale_y_continuous(breaks = seq(-150000, 150000, 50000), 
                     labels = paste0(as.character(c(seq(150, 0, -50), seq(50, 150, 50))))) +
  coord_flip()

age_cohort+
  ggtitle("Singapore Population Pyramid by Age Cohort, 2019")+
            xlab("Age Group")+
            ylab("Population in Thousands")+
  scale_fill_manual(values=c('lightpink2','steelblue3'))+
  theme_economist()+
  theme(legend.position='right')

For groups aged below 24, the populations of male are higher than that of female. But for groups aged above 24, the opposite is happening，especially for the aged group, the proportion of female is obviously higher than that of male.In the 95 and over group, the figure of female is around 3 times of male.

5. Ternary Plot

5.1 Preparing Data

We use three age groups, young, economically active and old as three components of ternary plot.

Use spread() function to calculate population in different age groups.
Use mutate to create new age groups. The rows number is order according to the new table after spreading.
Filter function to keep only year 2019 of data.

Age_pop <- sg %>%
  mutate(`Year` = as.character(Year))%>%
  spread(Age, Population)
Age_pop[is.na(Age_pop)]<-0
  
Age_population <- Age_pop%>%
  mutate(Young = rowSums(.[6:10]))%>%
  mutate(Economically_Active = rowSums(.[11:18]))  %>%
  mutate(Old = rowSums(.[19:24])) %>%
  mutate(TOTAL = rowSums(.[6:24])) %>%
  filter(Year == 2019)%>%
  filter(TOTAL > 0)

5.2 Creating Plot

ggtern is a package designed by R to easier create a ternary plot.

Use yound, economically active and old group as x, y and z variable respectively.
Use theme_rgbw to set the background.
Use hjust=0.5 to make the title in the center.

ternplot<- ggtern(data=Age_population, aes(x=Young,y=Economically_Active, z=Old)) +
  geom_point()
ternplot+
  ggtitle("Ternary Plot of Population Structure, 2019")+
            xlab("Young")+
            ylab("Active")+
  zlab("Old")+
  theme_rgbw()+
  theme(plot.title = element_text(hjust=0.5))

For majority of planning areas, economically active group always make up the largest percentage. But compare to part of younger area, has more points fall in the older area. Overall, the tendency of data samples to be more aging.

6. Bubble Plot

6.1 Preparing Data

Proportion of young population and proportion of old population are the pair of variables we want to compare in the bubble plot.

Use group_by() function to calcurate variables in zone dimension.
Use summarise() function to create variable calculated by variables that group by zone.
We should create a new table, select the variables we need in the summarise result table as variables of new table. This step is to make sure the variables we want to show in the bubble plot already existed in the R environment.

group_zone <- Age_population %>%
  group_by(Zone)

summary <- group_zone %>%
  summarise(total_young=sum(Young),total_old=sum(Old),total=sum(TOTAL),Young_per=total_young/total,Old_per=total_old/total) %>%
  select(Zone,total_young,total_old,Young_per,Old_per,total,total)

6.2 Creating Bubble Plot

Bubble plot could be created in R by using scatter plot with shape 21.

Reset the color of points to viridis palette.
Rename the title, x and y labels.
Because zone has too many values, set guide=FALSE to hide the legend.
Use range=c() to set the point size, easily to compare points.

bubbleplot<-ggplot(summary,aes(x=Old_per,y=Young_per,size=total,fill=Zone))+
  geom_point(alpha=0.5,shape=21,color="black")

bubbleplot +
  scale_x_continuous(labels=scales::percent)+
  scale_y_continuous(labels=scales::percent)+
  scale_fill_viridis(discrete=TRUE, guide=FALSE, option="A") +
  scale_size_continuous(name = "Population",range=c(0.1,12))+
  ggtitle("Bubble Plot of Comparison between Young and Old Population, 2019")+
  xlab("Proportion of Old")+
  ylab("Proportion of Young")+
  theme(legend.position="right")

More bubbles have higher proportion of aged population. The values of active group seem always be the biggest compare to that of young group and old group.

Demographic Structure of Singapore Population

Jiang Linlin

8 March 2020