Name: Shan Huijie

1. Overview

The data visualization below aimed at revealing the demographic structure of Singapore population by age cohort and by planning area in 2019.

dataset can be found from : https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data

1.1 Proposed Sketch of Design

Caption for the picture. Caption for the picture.

1.2 Data and Design Challenges

  1. The first challenge faced is in the data preparation stage where I find the arrange() function is not working properly. The ‘5_to_9’ group doesn’t follow the ‘0_to_4’ age group.
  2. The second challenge faced is to select the data needed from the original data according to the question that you want to visualize. As not all the columns are necessary in answering the question. The original data set has 7 columns which are PA (Planning Area), SZ (Subzone), AG (Age Group), Sex ,TOD (Type), Pop(Resident Count) and Time.
  3. Customizing the ggplot is challenging.

1.3 How to Overcome Challenges

  1. It seems like the arrange() function is trying to sort the data based on the first digit. Therefore,by renaming the ‘5_to_9’ to ‘05_to_09’ can solve the issue.
  2. In order to answer the question of “the demographic structure of Singapore population by age cohort (e.g., 0-4, 5- 9, ……) and by planning area in 2019”, I choose “PA”, “AG”, “Sex”, “Pop” as my input data columns.
  3. Do reseaches online and use the ggplot cheat sheet to customize the ggplot.

2. Step-by-step Description

2.1 Install and load packages

packages = c('tidyverse','dplyr')

for(p in packages){library
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

2.2 import data set

raw_dataset <- read_csv("respopagesextod2011to2020.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   PA = col_character(),
##   SZ = col_character(),
##   AG = col_character(),
##   Sex = col_character(),
##   TOD = col_character(),
##   Pop = col_double(),
##   Time = col_double()
## )

2.3 Data Preparation for Visualizations

In order to sort the age group in ascending order, we need to rename “5_to_9” to “05_to_09”, otherwise the “5_to_9” group will be ordered after “45_to_49” group.

raw_dataset$AG[raw_dataset$AG == "5_to_9"] <- "05_to_09"
head(raw_dataset)
## # A tibble: 6 x 7
##   PA        SZ               AG    Sex   TOD                           Pop  Time
##   <chr>     <chr>            <chr> <chr> <chr>                       <dbl> <dbl>
## 1 Ang Mo K… Ang Mo Kio Town… 0_to… Males HDB 1- and 2-Room Flats         0  2011
## 2 Ang Mo K… Ang Mo Kio Town… 0_to… Males HDB 3-Room Flats               10  2011
## 3 Ang Mo K… Ang Mo Kio Town… 0_to… Males HDB 4-Room Flats               30  2011
## 4 Ang Mo K… Ang Mo Kio Town… 0_to… Males HDB 5-Room and Executive F…    50  2011
## 5 Ang Mo K… Ang Mo Kio Town… 0_to… Males HUDC Flats (excluding thos…     0  2011
## 6 Ang Mo K… Ang Mo Kio Town… 0_to… Males Landed Properties               0  2011
2.3.1 Prepare data for pyramid plot with age group and planning area

In order to plot the “Age-Sex Pyramid by age group and planning area”, we need to first filter the data in year 2019 from the original data set and then select the need columns: Planning Area, Age group, Sex and population. Use the summarise() function to sum up the population based on planning area, age group and sex.

pyramid_withPA <- raw_dataset %>% 
      filter(Time == "2019") %>% 
      select(c("PA", "AG", "Sex", "Pop")) %>%
      arrange(AG) %>%
      group_by(PA, AG, Sex) %>% 
      summarise(Pop = sum(Pop))
## `summarise()` has grouped output by 'PA', 'AG'. You can override using the `.groups` argument.
head(pyramid_withPA)
## # A tibble: 6 x 4
## # Groups:   PA, AG [3]
##   PA         AG       Sex       Pop
##   <chr>      <chr>    <chr>   <dbl>
## 1 Ang Mo Kio 0_to_4   Females  2660
## 2 Ang Mo Kio 0_to_4   Males    2760
## 3 Ang Mo Kio 05_to_09 Females  3110
## 4 Ang Mo Kio 05_to_09 Males    3120
## 5 Ang Mo Kio 10_to_14 Females  3670
## 6 Ang Mo Kio 10_to_14 Males    3710
2.3.2 Prepare data for overall sex-population pyramid plot

In order to plot the overall sex-population pyramid plot in singapore 2019, we need to first use filter() to select the 2019 data set and choose PA,AG,SEX and Pop columns as above. Use arrange() function to sort the age group in ascending order. The aggregate() function is used to split data into subsets and computes the sum of population based on the age group and sex.

sg_pyramid <- raw_dataset %>% 
      filter(Time == "2019") %>% 
      select(c("PA", "AG", "Sex", "Pop")) %>%
      arrange(AG) 
sex_pop_pyramid <- aggregate(Pop~AG+Sex,data=sg_pyramid,FUN=sum)
head(sex_pop_pyramid)
##         AG     Sex    Pop
## 1   0_to_4 Females  90850
## 2 05_to_09 Females  97040
## 3 10_to_14 Females 102550
## 4 15_to_19 Females 108910
## 5 20_to_24 Females 122480
## 6 25_to_29 Females 145960

2.4 Plot Visualizations

2.4.1 Plot Overview of Singapore Sex-Population Pyramid

We use ggplot to plot the visualizations. geom_col() is used for bar charts. In order to visualizing both of the female and male bar chart in one axes, we use aes() function to do aesthetic mappings. aesthetics X axis is “AG” and the Y axis do a condition check on the sex. We multiply the male population to make the chart filpped. The fill parameter defines the legend is sex. coord_flip() is used to flip the X and Y axes. scale_y_continuous() is used to transform the negative value in y axes to positive.

pyramidplot<-ggplot(sex_pop_pyramid,aes(x = AG, y = ifelse(Sex == "Males", yes = -Pop, no = Pop),fill = Sex))+
  geom_col()+
  coord_flip()+
  scale_fill_brewer(palette = "Set1") + 
  scale_y_continuous(labels = abs, limits = max(sex_pop_pyramid$Pop)*c(-1,1))+
  labs(
    x="Age Group",
    y = "Population",
    title = "Overview of Singapore Sex-Population Pyramid,2019",
    caption = "Data Source: Singstat.com"
  )+
  theme(plot.title = element_text(size=12, face='bold'))

pyramidplot

2.4.2 Plot Singapore Sex-Population Pyramid by Planing Area and Age group

labs() is used to define the x,y, caption and title label. scale_fill_brewer(), theme() functions are all used to format the data visualization to be more aesthetic.

ggplot(pyramid_withPA, aes(x = AG,fill = Sex,y = ifelse ( 
  test = Sex == "Males",yes = -Pop,no = Pop))) +
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = abs,
                     limits = max(pyramid_withPA$Pop) * c(-1, 1)) +
  coord_flip() +
  labs(x = "Age Group", y = "Population", title = "Singapore Demographic Structure by Age Group and Planning Area in 2019",caption = "Data Source: Singstat.com") +
  scale_fill_brewer(palette = "Set1") + 
  theme(axis.text.x=element_blank())+
  theme(axis.text.y=element_blank())+
  theme(plot.title = element_text(size=12, face='bold'))+
  theme(axis.title.x=element_text(size=9),axis.title.y=element_text(size=9))+
  facet_wrap(~ PA)

3. Insights from final visualization

Insight 1: Singapore’s Population in 2019 From the overview of Sex-Population Pyramid chart, we can find that the adult age group from 25-59 contributes majorly in the Singapore population. The distribution of females and males are almost equally.

Insight 2: Clusters From the various sex-population pyramids by age group and planning area, we can find that some planning areas has similar distributions and we can group them into different clusters.